Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial...

17
Trilinos Package Summary Objective Package(s) Discretizati ons Meshing & Spatial Discretizations phdMesh, Intrepid, Pamgen Time Integration Rythmos Methods Automatic Differentiation Sacado Mortar Methods Moertel Core Linear algebra objects Epetra, Jpetra, Tpetra Abstract interfaces Thyra, Stratimikos, RTOp Load Balancing Zoltan, Isorropia “Skins” PyTrilinos, WebTrilinos, Star-P, ForTrilinos, CTrilinos C++ utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, TPI Solvers Iterative (Krylov) linear solvers AztecOO, Belos, Komplex Direct sparse linear solvers Amesos Direct dense linear solvers Epetra, Teuchos, Pliris Iterative eigenvalue solvers Anasazi ILU-type preconditioners AztecOO, IFPACK Multilevel preconditioners ML, CLAPS Block preconditioners Meros

Transcript of Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial...

Page 1: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Trilinos Package SummaryObjective Package(s)

DiscretizationsMeshing & Spatial Discretizations phdMesh, Intrepid, Pamgen

Time Integration Rythmos

MethodsAutomatic Differentiation Sacado

Mortar Methods Moertel

Core

Linear algebra objects Epetra, Jpetra, Tpetra

Abstract interfaces Thyra, Stratimikos, RTOp

Load Balancing Zoltan, Isorropia

“Skins” PyTrilinos, WebTrilinos, Star-P, ForTrilinos, CTrilinos

C++ utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, TPI

Solvers

Iterative (Krylov) linear solvers AztecOO, Belos, Komplex

Direct sparse linear solvers Amesos

Direct dense linear solvers Epetra, Teuchos, Pliris

Iterative eigenvalue solvers Anasazi

ILU-type preconditioners AztecOO, IFPACK

Multilevel preconditioners ML, CLAPS

Block preconditioners Meros

Nonlinear system solvers NOX, LOCA

Optimization (SAND) MOOCHO, Aristos

Stochastic PDEs Stokhos

Page 2: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Changing Scope of Trilinos Capabilities:

Past: Solver capabilities and supporting components. Now: Any library for science/engineering (Zoltan, Intrepid, …).

Customers: Past: Sandia and other NNSA customers. Now: Expanding to Office of Science applications, DoD, DHS, CRADAs and WFO.

Platforms: Past: All platforms using command-line installer (Autotools). Linux/Unix bias. Now: Expanding to GUI & binary installer (Cmake). Native Winodws/Mac process.

The Changing Scope of the Trilinos Project, Michael A. Heroux, Technical Report, Sandia National Laboratories, SAND2007-7775, December 2007.

Page 3: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Capability Leaders:New Layer of Proactive Leadership

Areas: Framework, Tools & Interfaces (J. Willenbring). Discretizations (P. Bochev). Geometry, Meshing & Load Balancing (K. Devine). Scalable Linear Algebra (M. Heroux). Linear & Eigen Solvers (J. Hu). Nonlinear, Transient & Optimization Solvers (A. Salinger).

Each leader provides strategic direction across all Trilinos packages within area.

Page 4: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

A Few HPCCG Multicore Results

Float useful: Mixed precision algorithms.

Bandwidth even more important: Saturation means loss of cores.

Memory placement a concern: Shared memory allows remote placement.

NiagaraT2 threads hide latency: Easiest node to program.

Page 5: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

More Float vs Double: Barcelona pHPCCG:

Float faster than double. Float scales better.

Page 6: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Multi-Programming Model Runtime Environment: Niagara2

MPI & MPI+threads: App:

• Scales (superlinearly)• MPI-only sufficient.

Solver: • BW-limited.• MPI+threads can help.

Page 7: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Library Preparations for New Node Architectures (Decision Made Years Ago)

We knew node architectures would change… Abstract Parallel Machine Interface: Comm Class. Abstract Linear Algebra Objects:

Operator class: Action of operator only, no knowledge of how. RowMatrix class: Serve up a row of coefficients on demand. Pure abstract layer: No unnecessary constraints at all.

Model Evaluator: Highly flexible API for linear/non-linear solver services.

Templated scalar and integer types: Compile-time resolution float, double, quad,… int, long long,… Mixed precision algorithms.

Page 8: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Library Effort in Response toNode Architecture Trends

Block Krylov Methods (Belos & Anasazi): Natural for UQ, QMU, Sensitivity Analysis… Superior Node and Network complexity.

Templated Kernel Libraries (Tpetra & Tifpack): Choice of float vs double made when object created. High-performance multiprecision algorithms.

Threaded Comm Class (Tpetra): Intel TBB support, compatible with OpenMP, Pthreads, … Clients of Tpetra::TbbMpiComm can access static, ready-to-work thread pool. Code above the basic kernel level is unaware of threads.

Specialized sparse matrix data structures: Sparse diagonal, sparse-dense, composite.

MPI-only+MPI/PNAS Application runs MPI-only (8 flat MPI processes on dual quad-core) Solver runs:

• MPI-only when interfacing with app using partitioned nodal address space (PNAS).• 2 MPI processes, 4 threads each when solving problem.

Page 9: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

What is BEC?

A programming model developed at Sandia based on careful analysis of Sandia applications, strengths / weaknesses of past and current programming models, and technology evolution path.

Code example:

Shared int A[10000], B[10000], C[10000]; /* globally shared data */

BEC_request(A[3]) ;

BEC_request(B[8]) ; /* Bundle requests */

BEC_exchange(); /* Exchange bundled requests globally */

C[10] = A[3]+B[8]; /* Computation using shared data like local */

BEC Model: BEC combines the convenience of virtual shared memory (a.k.a. Global Address Space (GAS)) and

efficiency of Bulk Synchronous Parallel (BSP); BEC has built-in capabilities for efficient support for high-volume random fine-grained

communication (accesses to virtual shared memory).

Page 10: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

BEC Application: HPCCG

Form & solve a linear system using the Conjugate Gradient (CG) method MPI version: Part of Mantevo toolset. The BEC and MPI versions have two main steps:

Bundle preparations (message queues setup, data bundling, organizing remotely fetched data, etc), CG iterations (until convergence)

UPC version: a benchmark from the UPC Consortium (“fully optimized”)

Task Number of lines of code

BEC MPI UPC

Bundle preparation 6 240 N/A

CG iterations (computation related code) 60 87 N/A

Communication related code 11 277 N/A

Whole program (excluding empty lines, comments)

233 733 About 900

Page 11: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Iteration in BEC (many phases of Bundle-Exchange-Compute)

// Note: Since the requests for the shared data are always the same for this application,

// there is no need to make explicit BEC_request() calls again. Just call BEC_repeat_request().

for (int iter = 1; iter < max_iter; iter ++){

BEC_repeat_requests(bundle);

BEC_exchange();

// Compute -----------------------------------------------------

mv(A, p, bundle, Ap); /* BEC and MPI versions of mv() are similar */

ddot(p, Ap, &alpha);

alpha = rtrans / alpha;

waxpby(1.0, x, alpha, p, x); /* BEC and MPI versions of waxpby() are similar */

waxpby(1.0, r, -alpha, Ap, r);

oldrtrans = rtrans;

ddot(r, r, &rtrans);

normr = sqrt(rtrans);

if (normr <= tolerance) break; // converge

beta = rtrans / oldrtrans;

waxpby(1.0, r, beta, p, p);

}

Page 12: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

BEC Application: Graph Coloring

Vertex coloring Algorithm (heuristic): Largest Degree First

Task Number of lines of code

BEC MPI

Communication related code (including bundling)

10 69

Computation 58 61

Whole program (excluding empty lines, comments)

131 201

Page 13: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Graph Coloring Performance Comparison (on “Franklin”, Cray XT3 at NERSC)

Page 14: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Using BEC (1)

BEC Beta Release available for download (http://www.cs.sandia.gov/BEC) Portable to machines (including PCs) with:

• Unix/Linux, C++ compiler, message-passing library (e.g. MPI, Portals) BEC can be used alone, or mixed with MPI (e.g. in the same function)

BEC includes An extension to ANSI C to allow “shared” variables (array syntax) A runtime library, BEC lib

Use BEC with the language extension

BEC code (C code) + (BEC lib function calls) Use BEC lib directly

(C or Fortran code) + (BEC lib function calls) “A[10] = …” is written as “BEC_write(A, 10, …)”

Page 15: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Using BEC (2): BEC Language Extension vs. BEC Lib

// Declaration

typedef struct { int x; int y; } my_type;

shared my_type my_partition A[n];

shared double B[m][n];

// Shared data requests

BEC_request(A[3].x);

BEC_request(B[3][4]);

// Exchange of shared data

BEC_exchange();

// computation

S = B[3][4] + A[3].x

// Declarationtypedef struct my_type{ int x; int y; };BEC_shared_1d(my_type , my_partition ) A;BEC_shared_2d(double, equal) B; // Shared Space allocation BEC_joint_allocate(A, n); BEC_joint_allocate(B, m, n);

// Shared data requests BEC_request(BEC_element_attribute(A, x), 3);BEC_request(B, 3, 4);

// Global exchange of shared dataBEC_exchange();

// computation S = BEC_read(B,3, 4) +

BEC_read(BEC_element_attribute(A, x), 3);

Page 16: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

BEC Lib FunctionsBasic

BEC_initialize(); BEC_finalize();

BEC_request();

BEC exchange();

BEC_read(); BEC_write(); For dynamic shared space

BEC_joint_allocate(); BEC_joint_free(); --------------------------- For Performance Optimization --------------------------------Communication & computation overlap

BEC_exchange_begin(); BEC_exchange_end(); Special requests

BEC_request_to(); BEC_apply_to(); Reusable bundle object

BEC_create_persistent_bundle(); BEC_delete_bundle();

BEC_repeat_requests();

Directly getting data out of a bundle

BEC_get_index_in_bundle(); BEC_increment_index_in_bundle();

BEC_get_value_in_bundle();

Local portion of a shared array

BEC_local_element_count(); BEC_local_element();Miscellaneous

BEC_get_global_index(); BEC_proc_count(); BEC_my_proc_id();

Page 17: Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Some Future Directions

Increased compatibility with MPI: Light-weight “translation” of BEC global arrays into MPI (plain) local arrays and vice-versa. BEC embeddable into an existing MPI app (can already do in most instances).

Hybrid BEC+MPI HPCCG: Setup/exchange done with BEC. Key kernels performed using plain local arrays. Vector ops using either BEC or MPI collectives.

Control structures: Parallel Loops Virtualized processors: > MPI size Bundle full first-class object: Already mostly there. Simultaneous exchange_begin()’s

Queuing, pipelining of iterations