Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial...
-
Upload
darren-armstrong -
Category
Documents
-
view
226 -
download
0
Transcript of Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial...
![Page 1: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/1.jpg)
Trilinos Package SummaryObjective Package(s)
DiscretizationsMeshing & Spatial Discretizations phdMesh, Intrepid, Pamgen
Time Integration Rythmos
MethodsAutomatic Differentiation Sacado
Mortar Methods Moertel
Core
Linear algebra objects Epetra, Jpetra, Tpetra
Abstract interfaces Thyra, Stratimikos, RTOp
Load Balancing Zoltan, Isorropia
“Skins” PyTrilinos, WebTrilinos, Star-P, ForTrilinos, CTrilinos
C++ utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, TPI
Solvers
Iterative (Krylov) linear solvers AztecOO, Belos, Komplex
Direct sparse linear solvers Amesos
Direct dense linear solvers Epetra, Teuchos, Pliris
Iterative eigenvalue solvers Anasazi
ILU-type preconditioners AztecOO, IFPACK
Multilevel preconditioners ML, CLAPS
Block preconditioners Meros
Nonlinear system solvers NOX, LOCA
Optimization (SAND) MOOCHO, Aristos
Stochastic PDEs Stokhos
![Page 2: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/2.jpg)
Changing Scope of Trilinos Capabilities:
Past: Solver capabilities and supporting components. Now: Any library for science/engineering (Zoltan, Intrepid, …).
Customers: Past: Sandia and other NNSA customers. Now: Expanding to Office of Science applications, DoD, DHS, CRADAs and WFO.
Platforms: Past: All platforms using command-line installer (Autotools). Linux/Unix bias. Now: Expanding to GUI & binary installer (Cmake). Native Winodws/Mac process.
The Changing Scope of the Trilinos Project, Michael A. Heroux, Technical Report, Sandia National Laboratories, SAND2007-7775, December 2007.
![Page 3: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/3.jpg)
Capability Leaders:New Layer of Proactive Leadership
Areas: Framework, Tools & Interfaces (J. Willenbring). Discretizations (P. Bochev). Geometry, Meshing & Load Balancing (K. Devine). Scalable Linear Algebra (M. Heroux). Linear & Eigen Solvers (J. Hu). Nonlinear, Transient & Optimization Solvers (A. Salinger).
Each leader provides strategic direction across all Trilinos packages within area.
![Page 4: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/4.jpg)
A Few HPCCG Multicore Results
Float useful: Mixed precision algorithms.
Bandwidth even more important: Saturation means loss of cores.
Memory placement a concern: Shared memory allows remote placement.
NiagaraT2 threads hide latency: Easiest node to program.
![Page 5: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/5.jpg)
More Float vs Double: Barcelona pHPCCG:
Float faster than double. Float scales better.
![Page 6: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/6.jpg)
Multi-Programming Model Runtime Environment: Niagara2
MPI & MPI+threads: App:
• Scales (superlinearly)• MPI-only sufficient.
Solver: • BW-limited.• MPI+threads can help.
![Page 7: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/7.jpg)
Library Preparations for New Node Architectures (Decision Made Years Ago)
We knew node architectures would change… Abstract Parallel Machine Interface: Comm Class. Abstract Linear Algebra Objects:
Operator class: Action of operator only, no knowledge of how. RowMatrix class: Serve up a row of coefficients on demand. Pure abstract layer: No unnecessary constraints at all.
Model Evaluator: Highly flexible API for linear/non-linear solver services.
Templated scalar and integer types: Compile-time resolution float, double, quad,… int, long long,… Mixed precision algorithms.
![Page 8: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/8.jpg)
Library Effort in Response toNode Architecture Trends
Block Krylov Methods (Belos & Anasazi): Natural for UQ, QMU, Sensitivity Analysis… Superior Node and Network complexity.
Templated Kernel Libraries (Tpetra & Tifpack): Choice of float vs double made when object created. High-performance multiprecision algorithms.
Threaded Comm Class (Tpetra): Intel TBB support, compatible with OpenMP, Pthreads, … Clients of Tpetra::TbbMpiComm can access static, ready-to-work thread pool. Code above the basic kernel level is unaware of threads.
Specialized sparse matrix data structures: Sparse diagonal, sparse-dense, composite.
MPI-only+MPI/PNAS Application runs MPI-only (8 flat MPI processes on dual quad-core) Solver runs:
• MPI-only when interfacing with app using partitioned nodal address space (PNAS).• 2 MPI processes, 4 threads each when solving problem.
![Page 9: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/9.jpg)
What is BEC?
A programming model developed at Sandia based on careful analysis of Sandia applications, strengths / weaknesses of past and current programming models, and technology evolution path.
Code example:
Shared int A[10000], B[10000], C[10000]; /* globally shared data */
BEC_request(A[3]) ;
BEC_request(B[8]) ; /* Bundle requests */
BEC_exchange(); /* Exchange bundled requests globally */
C[10] = A[3]+B[8]; /* Computation using shared data like local */
BEC Model: BEC combines the convenience of virtual shared memory (a.k.a. Global Address Space (GAS)) and
efficiency of Bulk Synchronous Parallel (BSP); BEC has built-in capabilities for efficient support for high-volume random fine-grained
communication (accesses to virtual shared memory).
![Page 10: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/10.jpg)
BEC Application: HPCCG
Form & solve a linear system using the Conjugate Gradient (CG) method MPI version: Part of Mantevo toolset. The BEC and MPI versions have two main steps:
Bundle preparations (message queues setup, data bundling, organizing remotely fetched data, etc), CG iterations (until convergence)
UPC version: a benchmark from the UPC Consortium (“fully optimized”)
Task Number of lines of code
BEC MPI UPC
Bundle preparation 6 240 N/A
CG iterations (computation related code) 60 87 N/A
Communication related code 11 277 N/A
Whole program (excluding empty lines, comments)
233 733 About 900
![Page 11: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/11.jpg)
Iteration in BEC (many phases of Bundle-Exchange-Compute)
// Note: Since the requests for the shared data are always the same for this application,
// there is no need to make explicit BEC_request() calls again. Just call BEC_repeat_request().
for (int iter = 1; iter < max_iter; iter ++){
BEC_repeat_requests(bundle);
BEC_exchange();
// Compute -----------------------------------------------------
mv(A, p, bundle, Ap); /* BEC and MPI versions of mv() are similar */
ddot(p, Ap, &alpha);
alpha = rtrans / alpha;
waxpby(1.0, x, alpha, p, x); /* BEC and MPI versions of waxpby() are similar */
waxpby(1.0, r, -alpha, Ap, r);
oldrtrans = rtrans;
ddot(r, r, &rtrans);
normr = sqrt(rtrans);
if (normr <= tolerance) break; // converge
beta = rtrans / oldrtrans;
waxpby(1.0, r, beta, p, p);
}
![Page 12: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/12.jpg)
BEC Application: Graph Coloring
Vertex coloring Algorithm (heuristic): Largest Degree First
Task Number of lines of code
BEC MPI
Communication related code (including bundling)
10 69
Computation 58 61
Whole program (excluding empty lines, comments)
131 201
![Page 13: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/13.jpg)
Graph Coloring Performance Comparison (on “Franklin”, Cray XT3 at NERSC)
![Page 14: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/14.jpg)
Using BEC (1)
BEC Beta Release available for download (http://www.cs.sandia.gov/BEC) Portable to machines (including PCs) with:
• Unix/Linux, C++ compiler, message-passing library (e.g. MPI, Portals) BEC can be used alone, or mixed with MPI (e.g. in the same function)
BEC includes An extension to ANSI C to allow “shared” variables (array syntax) A runtime library, BEC lib
Use BEC with the language extension
BEC code (C code) + (BEC lib function calls) Use BEC lib directly
(C or Fortran code) + (BEC lib function calls) “A[10] = …” is written as “BEC_write(A, 10, …)”
![Page 15: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/15.jpg)
Using BEC (2): BEC Language Extension vs. BEC Lib
// Declaration
typedef struct { int x; int y; } my_type;
shared my_type my_partition A[n];
shared double B[m][n];
// Shared data requests
BEC_request(A[3].x);
BEC_request(B[3][4]);
// Exchange of shared data
BEC_exchange();
// computation
S = B[3][4] + A[3].x
// Declarationtypedef struct my_type{ int x; int y; };BEC_shared_1d(my_type , my_partition ) A;BEC_shared_2d(double, equal) B; // Shared Space allocation BEC_joint_allocate(A, n); BEC_joint_allocate(B, m, n);
// Shared data requests BEC_request(BEC_element_attribute(A, x), 3);BEC_request(B, 3, 4);
// Global exchange of shared dataBEC_exchange();
// computation S = BEC_read(B,3, 4) +
BEC_read(BEC_element_attribute(A, x), 3);
![Page 16: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/16.jpg)
BEC Lib FunctionsBasic
BEC_initialize(); BEC_finalize();
BEC_request();
BEC exchange();
BEC_read(); BEC_write(); For dynamic shared space
BEC_joint_allocate(); BEC_joint_free(); --------------------------- For Performance Optimization --------------------------------Communication & computation overlap
BEC_exchange_begin(); BEC_exchange_end(); Special requests
BEC_request_to(); BEC_apply_to(); Reusable bundle object
BEC_create_persistent_bundle(); BEC_delete_bundle();
BEC_repeat_requests();
Directly getting data out of a bundle
BEC_get_index_in_bundle(); BEC_increment_index_in_bundle();
BEC_get_value_in_bundle();
Local portion of a shared array
BEC_local_element_count(); BEC_local_element();Miscellaneous
BEC_get_global_index(); BEC_proc_count(); BEC_my_proc_id();
![Page 17: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.](https://reader034.fdocuments.in/reader034/viewer/2022051215/56649e625503460f94b5df5f/html5/thumbnails/17.jpg)
Some Future Directions
Increased compatibility with MPI: Light-weight “translation” of BEC global arrays into MPI (plain) local arrays and vice-versa. BEC embeddable into an existing MPI app (can already do in most instances).
Hybrid BEC+MPI HPCCG: Setup/exchange done with BEC. Key kernels performed using plain local arrays. Vector ops using either BEC or MPI collectives.
Control structures: Parallel Loops Virtualized processors: > MPI size Bundle full first-class object: Already mostly there. Simultaneous exchange_begin()’s
Queuing, pipelining of iterations