The Trilinos Project Exascale Roadmap · The Trilinos Project Exascale Roadmap Michael A. Heroux...
Transcript of The Trilinos Project Exascale Roadmap · The Trilinos Project Exascale Roadmap Michael A. Heroux...
1
The Trilinos Project ExascaleRoadmap
Michael A. HerouxSandia National Laboratories
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
Contributors: Mark Hoemmen, Siva Rajamanickam, Tobias Wiesner,Alicia Klinvex, Heidi Thornquist,
Kate Evans, Andrey Prokopenko, Lois McInnes
trilinos.github.io
Outline
§ Challenges.§ BriefOverviewofTrilinos.§ On-nodeparallelism.§ ParallelAlgorithms.§ TrilinosProductsOrganization.§ ForTrilinos.§ ThexSDK.
2
Challenges
§ On-nodeconcurrencyexpression:§ Vectorization/SIMT.§ On-nodetasking.§ Portability.
§ Parallelalgorithms:§ Vector/SIMTexpressible.§ Latencytolerant.§ Highlyscalable.
§ Multi-scaleandmulti-physics.§ Preconditioning.§ Softwarecomposition.
§ Resilience(DiscussedtomorrowmorningMS40).
3
4
Trilinos Overview
What is Trilinos?§ Object-oriented software framework for…§ Solving big complex science & engineering problems.§ Large collection of reusable scientific capabilities.§ More like LEGO™ bricks than Matlab™.
5
OptimalKernelstoOptimalSolutions:w Geometry,Meshingw Discretizations,LoadBalancing.w ScalableLinear,Nonlinear,Eigen,
Transient,Optimization,UQsolvers.w ScalableI/O,GPU,Manycore
w 60+Packages.w Otherdistributions:
w CrayLIBSCI.w GitHubrepo.
w ThousandsofUsers.w Worldwidedistribution.
LaptopstoLeadershipsystems
Trilinos Package SummaryObjective Package(s)
DiscretizationsMeshing & Discretizations STK, Intrepid, Pamgen, Sundance, ITAPS, Mesquite
Time Integration Rythmos
MethodsAutomatic Differentiation Sacado
Mortar Methods Moertel
Services
Linear algebra objects Epetra, Tpetra, Kokkos, Xpetra
Interfaces Thyra, Stratimikos, RTOp, FEI, Shards
Load Balancing Zoltan, Isorropia, Zoltan2“Skins” PyTrilinos, WebTrilinos, ForTrilinos, Ctrilinos, Optika
C++ utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, ThreadPool, Phalanx, Trios
Solvers
Iterative linear solvers AztecOO, Belos, Komplex
Direct sparse linear solvers Amesos, Amesos2, ShyLUDirect dense linear solvers Epetra, Teuchos, Pliris
Iterative eigenvalue solvers Anasazi, Rbgen
ILU-type preconditioners AztecOO, IFPACK, Ifpack2, ShyLU
Multilevel preconditioners ML, CLAPS, Muelu
Block preconditioners Meros, Teko
Nonlinear system solvers NOX, LOCA, Piro
Optimization (SAND) MOOCHO, Aristos, TriKota, Globipack, Optipack
Stochastic PDEs Stokhos
Unique features of Trilinos
8
§ Huge library of algorithms§ Linear and nonlinear solvers, preconditioners, …§ Optimization, transients, sensitivities, uncertainty, …
§ Growing support for multicore & hybrid CPU/GPU§ Built into the new Tpetra linear algebra objects
§ Therefore into iterative solvers with zero effort!§ Unified intranode programming model: Kokkos§ Spreading into the whole stack:
§ Multigrid, sparse factorizations, element assembly…
§ Support for mixed and arbitrary precisions§ Don’t have to rebuild Trilinos to use it
§ Support for flexible 2D sparse partitioning§ Useful for graph analytics, other data science apps.
§ Support for huge (> 2B unknowns) problems
AnasaziSnapshot
§ Abbreviationskey§ “Gen.”=generalizedeigenvaluesystem.§ “Prec:”=canuseapreconditioner.§ BKS=BlockKrylov Schur.§ LOBPCG=LocallyOptimial BlockPreconditionedConjugateGradient.§ RTR=RiemannianTrust-Regionmethod.
§ ^ denotesthatthesefeaturesmaybeimplementedifthereissufficientinterest.§ $ denotesthattheTraceMin familyofsolversiscurrentlyexperimental. 9
Core team:- Alicia Klinvex- Rich Lehoucq- Heidi Thornquist
Works with:- Epetra- Tpetra- Custom
https://trilinos.github.io/anasazi.html
Trilinoslinearsolvers§ Sparselinearalgebra
(Kokkos/KokkosKernels/Tpetra)§ Threadedconstruction,Sparsegraphs,(block)
sparsematrices,densevectors,parallelsolvekernels,parallelcommunication&redistribution
§ Iterative(Krylov)solvers (Belos)§ CG,GMRES,TFQMR,recyclingmethods
10
§ Sparsedirectsolvers (Amesos2)§ Algebraiciterativemethods (Ifpack2)
§ Jacobi,SOR,polynomial,incompletefactorizations,additiveSchwarz
§ Shared-memoryfactorizations (ShyLU)§ LU,ILU(k),ILUt,IC(k),iterativeILU(k)§ Direct+iterative preconditioners
§ Segregated blocksolvers(Teko)§ Algebraicmultigrid (MueLu)
KokkosKernels
11
On-node data & execution
Mustsupport>3architectures§ Comingsystemstosupport
§ Trinity (IntelHaswell &KNL)§ Sierra:NVIDIAGPUs+IBM
multicoreCPUs§ Plus“everythingelse”
§ 3differentarchitectures§ MulticoreCPUs(bigcores)§ Manycore CPUs(smallcores)§ GPUs(highlyparallel)
§ MPIonly,&MPI+threads§ Threadsdon’talwayspayon
non-GPUarchitecturestoday§ Portingtothreadsmustnot
slowdowntheMPI-onlycase12
Kokkos:Performance,Portability,&Productivity
13
DDR#
HBM#
DDR#
HBM#
DDR#DDR#
DDR#
HBM#HBM#
Kokkos#
LAMMPS# Sierra# Albany#Trilinos#
14
Kokkos ProgrammingModel
• Machinemodel:• N execution space + M memory spaces • NxM matrix for memory access performance/possibility• Asynchronous execution allowed
• Implementationapproach• A C++ template library• C++11 now required• Target different back-ends for different hardware architecture• Abstract hardware details and execution mapping details away
• Distribution• Open Source library• Soon (i.e. in the next few weeks) available on GitHub
• LongTermVision• Move features into the C++ standard (Carter Edwards voting committee member)
Goal:OneCodegivesgoodperformanceoneveryplatform
15
AbstractionConcepts
Execution Pattern: parallel_for, parallel_reduce, parallel_scan, task, …
Execution Policy : how (and where) a user function is executedE.g., data parallel range : concurrently call function(i) for i = [0..N)User’s function is a C++ functor or C++11 lambda
Execution Space : where functions executeEncapsulates hardware resources; e.g., cores, GPU, vector units, ...
Memory Space : where data residesØ AND what execution space can access that dataAlso differentiated by access performance; e.g., latency & bandwidth
Memory Layout : how data structures are ordered in memoryØ provide mapping from logical to physical index space
Memory Traits : how data shall be accessedØ allow specialisation for different usage scenarios (read only, random, atomic, …)
16
ExecutionPattern#include <Kokkos_Core.hpp>#include <cstdio>
int main(int argc, char* argv[]) {// Initialize Kokkos analogous to MPI_Init()// Takes arguments which set hardware resources (number of threads, GPU Id)Kokkos::initialize(argc, argv);
// A parallel_for executes the body in parallel over the index space, here a simple range 0<=i<10// It takes an execution policy (here an implicit range as an int) and a functor or lambda// The lambda operator has one argument, and index_type (here a simple int for a range)Kokkos::parallel_for(10,[=](int i){
printf(”Hello %i\n",i); });
// A parallel_reduce executes the body in parallel over the index space, here a simple range 0<=i<10 and // performs a reduction over the values given to the second argument // It takes an execution policy (here an implicit range as an int); a functor or lambda; and a return valuedouble sum = 0;Kokkos::parallel_reduce(10,[=](int i, int& lsum) {
lsum += i; },sum);printf("Result %lf\n",sum);
// A parallel_scan executes the body in parallel over the index space, here a simple range 0<=i<10 and // Performs a scan operation over the values given to the second argument // If final == true lsum contains the prefix sum. double sum = 0;Kokkos::parallel_scan(10,[=](int i, int& lsum, bool final) {
if(final) printf(”ScanValue %i\n",lsum); lsum += i;
});
Kokkos::finalize();}
Kokkos protectsusagainst…§ Hardwaredivergence§ Programmingmodeldiversity§ Threadsatall
§ Kokkos::Serial back-end§ Kokkos’semanticsrequire
vectorizable (ivdep)loops§ Exposeparallelismtoexploitlater§ Hierarchicalparallelismmodel
encouragesexploitinglocality
§ Kokkos protectsourHUGEtimeinvestmentofportingTrilinos
§ Note:Kokkos isnotmagic,cannotmakebadalgorithmsscale.
17
Kokkos isourhedge
Parallel Algorithms
FoundationalTechnology:KokkosKernels§ ProvideBLAS(1,2,3);Sparse;GraphandTensorKernels§ Kokkos based:PerformancePortable§ Interfacestovendorlibrariesifapplicable(MKL,CuSparse,…)§ Goal:Providekernelsforalllevelsofnodehierarchy
19
VectorLane§ ElementalFunctions,Serial,e.g.3x3DGEMM
HyperThread§ VectorParallelism,Synchfree,e.g.MatrixRowx Vector
Core§ ThreadParallelism,SharedL1/L2,e.g.SubdomainSolve
Socket§ ThreadTeams,SharedL3,e.g.FullSolve
Kokkoskernels:SparseMatrix-MatrixMultiplication(SpGEMM)
• SpGEMM is the most expensive part of the multigrid setup.
• New portable write-avoiding algorithm in KokkosKernels is~14x faster than NVIDIA’s CUSPARSE on K80 GPUs.
• ~2.8x faster than Intel’s MKL on Intel’s Knights Landing (KNL).
• Memory scalable: Solving larger problems that cannot be solved by codes like NVIDIA’s CUSP and Intel’s MKL when using large number of threads.
• Up is good.
1.22x%
2.55x%
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
1% 2% 4% 8% 16% 32% 64% 128% 256% 1% 2% 4% 8% 16% 32% 64% 128% 256%
NoReuse% Reuse%
Geo
metric
%mean%of%th
e%GFLOPs%fo
r%vario
us%M
ulDp
licaD
ons%%on%KN
L%%%
(Stron
g%Scaling)%
KokkosKernels%
MKL%
AxP$ RX(AP)$ AxP$ RX(AP)$ AxP$ RX(AP)$
Laplace$ Brick$ Empire$
cuSPARSE$ 0.100$ 0.229$ 0.291$ 0.542$ 0.646$ 0.715$
KokkosKernels$ 1.489$ 1.458$ 2.234$ 2.118$ 2.381$ 1.678$
14.89x$
0.00$
0.25$
0.50$
0.75$
1.00$
1.25$
1.50$
1.75$
2.00$
2.25$
2.50$
Sparse$M
atrixHM
atrix$m
ulIplicaIon$$
GFLOPs$on$K80$
cuSPARSE$
KokkosKernels$
• Goal: Identify independent data that can be processed in parallel.
• Performance: Better quality (4x on average) and run time (1.5x speedup ) w.r.t cuSPARSE.
• Enables parallelization of preconditioners: Gauss Seidel: 82x speedup on KNC, 136x on K20 GPUs
63.87&
1.22&
11.01&
1.75&
2.38&
0.66&
0.41&
0.19&
0.44&
0.64&
0.13&
0.25&
0.50&
1.00&
2.00&
4.00&
8.00&
16.00&
32.00&
64.00&
ci in kr li h a rg e B Q
Speedu
p&w.r.t.&cuSPAR
SE& KokkosKernels& cuSPARSE&
0.02$
0.97$
0.19$
0.59$
0.95$
0.32$
0.39$
0.19$
0.33$
0.34$
0.00$
0.20$
0.40$
0.60$
0.80$
1.00$
1.20$
cir in kr liv ho au rg eu Bu Q
Normalized
$#colors$w.r.t.$
cuSPAR
SE$
KokkosKernels$cuSPARSE$
Kokkoskernels:GraphColoringandSymmetricGauss-Seidel
Kokkos and Kokkos Kernels are available independent of Trilinos:https://github.com/kokkos
▪ MPI+Xbasedsubdomainsolvers▪ DecouplethenotionofoneMPIrankasonesubdomain:Subdomainscanspan
multipleMPIrankseachwithitsownsubdomainsolverusingXorMPI+X▪ Subpackages ofShyLU:MultipleKokkos-basedoptionsforon-nodeparallelism
▪ Basker :LUorILU(t)factorization▪ Tacho:IncompleteCholesky - IC(k)▪ Fast-ILU:Fast-ILUfactorizationforGPUs
▪ KokkosKernels:ColoringbasedGauss-Seidel(M.Deveci),TriangularSolves(A.Bradley)
ShyLU andSubdomainSolvers:Overview
TachoBasker FAST-ILUKLU2
Amesos2 Ifpack2
ShyLU
KokkosKernels –SGS, Tri-Solve (HTS)
23
ObjectConstructionModifications:MakingMPI+Xtrulyscalable(lotsofwork)
§ Pattern:1. Count /estimateallocationsize;mayuseKokkos parallel_scan2. Allocate;useKokkos::Viewforbestdatalayout&firsttouch3. Fill:parallel_reduce overerrorcodes;ifyourunoutofspace,keep
going,counthowmuchmoreyouneed,&returnto(2)4. Compute (e.g.,solvethelinearsystem)usingfilleddatastructures
§ ComparetoFill,Setup,Solve sparselinearalgebrausepattern§ Fortran<=77codersshouldfindthisfamiliar§ Semanticschange:Runningoutofmemorynotanerror!
§ Alwaysreturn:Eithernosideeffects,orcorrectresult§ Callersmustexpectfailure&protectagainstinfiniteloops§ Generalizestootherkindsoffailures,evenfaulttolerance
24
Trilinos Product Organization
ProductLeaders:Maximizecohesion,controlcoupling
26
§ Product:§ Framework(J.Willenbring).§ DataServices(K.Devine).§ LinearSolvers(S.Rajamanickam).§ NonlinearSolvers(R.Pawlowski).§ Discretizations (M.Perego).
§ Productfocus:§ New,strongerleadershipmodel.§ Focus:
§ PublishedAPIs§ Highcohesionwithinproduct.Lowcouplingacross.§ Deliberateproduct-levelupstreamplanning&design.
ForTrilinos: Full, sustainable Fortran access to Trilinos
HistoryViaadhoc interfaces,FortranappshavebenefittedfromTrilinoscapabilitiesformanyyears,mostnotablyclimate.
Why ForTrilinosTrilinosprovidealargecollectionofrobust,productionscientificC++software,includingstate-of-the-artmanycore/GPUcapabilities.ForTrilinos willgiveaccesstoFortranapps.
Research Details– Accessviaadhoc APIshasexistedforyears,especiallyintheclimatecommunity.
– ForTrilinos providesnative,sustainableAPIs,includingsupportforuser-providedFortranphysics-basedpreconditioninganduser-definedFortranoperators.
– Useofrobustauto-generationcodeandAPItoolsmakesfutureextensibilityfeasible.
– Auto-generationtoolsapplytootherprojects.– ForTrilinos willprovideearlyaccesstolatestscalablemanycore/GPUfunctionality,sophisticatedsolvers.
MichaelHeroux(PI,SNL),KateEvans(co-PI,ORNL),devteambasedatORNL.
https://github.com/flang-compiler
The Extreme-Scale Scientific Software Development Kit (xSDK)
Extreme-scaleScientificSoftwareEcosystem
Libraries• Solvers,etc.• Interoperable.
Frameworks&tools• Docgenerators.• Test,buildframework.
Extreme-scaleScientificSoftwareDevelopmentKit(xSDK)
SWengineering• Productivitytools.• Models,processes.
Domaincomponents• Reactingflow,etc.• Reusable.
Documentationcontent• Sourcemarkup.• Embeddedexamples.
Testingcontent• Unittests.• Testfixtures.
Buildcontent• Rules.• Parameters.
Libraryinterfaces• Parameterlists.• Interfaceadapters.• Functioncalls.
Shareddataobjects• Meshes.• Matrices,vectors.
Nativecode&dataobjects• Singleusecode.• Coordinatedcomponentuse.• Applicationspecific.
Extreme-scaleScienceApplications
Domaincomponentinterfaces• Datamediatorinteractions.• Hierarchicalorganization.• Multiscale/multiphysics coupling.
Focusofkeyaccomplishments:xSDK foundations
Impact:• Improvedcodequality,usability,access,sustainability• InformpotentialusersthatanxSDK memberpackagecanbeeasilyusedwithotherxSDK packages• Foundationforworkonperformanceportability,deeperlevelsofpackageinteroperability
Buildingthefoundationofahighlyeffectiveextreme-scalescientificsoftwareecosystem
30
Focus:Increasingthefunctionality,quality,andinteroperabilityofimportantscientificlibraries,domaincomponents,anddevelopmenttools¨ xSDKrelease0.2.0:April2017(soon)
¤ Spack packageinstallationn ./spack install xsdk
¤ Packageinteroperabilityn Numericallibraries
n hypre,PETSc,SuperLU,Trilinosn Domaincomponents
n Alquimia,PFLOTRAN¨ xSDK communitypolicies:
¤ Addresschallengesininteroperabilityandsustainabilityofsoftwaredevelopedbydiversegroupsatdifferentinstitutions
website:xSDK.info
xSDK communitypolicies31
xSDK compatiblepackage:MustsatisfymandatoryxSDK policies:M1.SupportxSDK communityGNUAutoconf orCMakeoptions.M2.Provideacomprehensivetestsuite.M3.Employuser-providedMPIcommunicator.M4.Givebesteffortatportabilitytokeyarchitectures.M5.Provideadocumented,reliablewaytocontactthedevelopmentteam.M6.Respectsystemresourcesandsettingsmadebyotherpreviouslycalledpackages.M7.Comewithanopensourcelicense.M8.ProvidearuntimeAPItoreturnthecurrentversionnumberofthesoftware.M9.Usealimitedandwell-definedsymbol,macro,library,andincludefilenamespace.M10.Provideanaccessiblerepository(notnecessarilypubliclyavailable).M11.HavenohardwiredprintorIOstatements.M12.Allowinstalling,building,andlinkingagainstanoutsidecopyofexternalsoftware.M13.Installheadersandlibrariesunder<prefix>/include/and<prefix>/lib/.M14.Bebuildableusing64bitpointers.32bitisoptional.
Draft 0.3, Dec 2016
Alsospecifyrecommendedpolicies,whichcurrentlyareencouragedbutnotrequired:R1.Haveapublicrepository.R2.Possibletoruntestsuiteundervalgrindinordertotestformemorycorruptionissues.R3.Adoptanddocumentconsistentsystemforerrorconditions/exceptions.R4.Freeallsystemresourcesithasacquiredassoonastheyarenolongerneeded.R5.Provideamechanismtoexportorderedlistoflibrarydependencies.
xSDK memberpackage:MustbeanxSDK-compatiblepackage,and itusesorcanbeusedbyanotherpackageinthexSDK,andtheconnectinginterfaceisregularlytestedforregressions.
https://xsdk.info/policies
xSDK4 SnapshotStatus
ASCRxSDKRelease0.1and0.2givesapplicationteamssingle-pointinstallation,accesstoTrilinos,hypre,PETScandSuperLU.
ValueThexSDKisessentialformulti-scale/physicsapplicationcouplingandinteroperability,andbroadestaccesstocriticallibraries.xSDkeffortsexpandcollaborationscopetoalllabs.
Research Details– xSDKstartedunderDOEASCR(Ndousse-Fetter,2014).– PriortoxSDK,verydifficulttousexSDKmemberlibstogether,adhocapproachrequired,versioninghard.
– Latestrelease(andfuture)usesSpack forbuilds.– xSDK4ECPwillinclude3additionallibs.– xSDKeffortsenablenewscopeofcross-labcoordinationandcollaboration.
– Long-termgoal:Createcommunityandpolicy-basedlibraryandcomponentecosystemsforcompositionalapplicationdevelopment.
Ref:xSDK Foundations:TowardanExtreme-scaleScientificSoftwareDevelopmentKit,,Feb2017,https://arxiv.org/abs/1702.08425,toappearinSupercomputingFrontiersandInnovations.
xsdk.info
hypre
Trilinos
PETSc
SuperLU
Released:Feb2017
TestedonkeymachinesatALCF,OLCF,NERSC,alsoLinuxandMacOSX
SLATE
DTK
SUNDIALS
Release:Oct2017
All ECP math and scientific libraries,
Many community
libraries
Future
MichaelHeroux(co-leadPI,SNL),LoisCurfman McInnes (co-leadPI,ANL),JamesWillenbring(Releaselead,SNL)
MorexSDK info33
¨ Paper:xSDK Foundations:TowardanExtreme-scaleScientificSoftwareDevelopmentKit¤ R.Bartlett,I.Demeshko,T.Gamblin,G.Hammond,M.Heroux,J.Johnson,A.Klinvex,X.Li,L.C.McInnes,D.Osei-Kuffuor,J.Sarich,B.Smith,J.Willenbring,U.M.Yang
¤ https://arxiv.org/abs/1702.08425¤ ToappearinSupercomputingFrontiersandInnovations,2017
¨ CSE17Posters:¤ xSDK:WorkingtowardaCommunitySoftwareEcosystem
n https://doi.org/10.6084/m9.figshare.4531526¤ ManagingtheSoftwareEcosystemwithSpack
n https://doi.org/10.6084/m9.figshare.4702294
FinalTake-AwayPoints▪ Intra-nodeparallelismisbiggestchallengerightnow:
▪ Kokkos providesvehicleforreasoningandimplementingon-nodeparallel.▪ Eventualgoal:SearchandreplaceKokkos::withstd::
▪ Node-parallelalgorithmsarealreadyavailable.▪ Fullynodeparallelexecutionishardwork.
▪ Inter-nodeparallelism:▪ Muelu frameworkprovideflexibility:
▪ Pluggable,customizablecomponents.▪ Multi-physics.
▪ TrilinosProducts:▪ Improvesupstreamplanningabilities.▪ Enablesincreasedcohesion,reduced(incidental)coupling.
▪ Communityexpansion:▪ ForTrilinos providesnativeaccessandextensibilityforFortranusers.▪ xSDK providesturnkey,consistentaccesstogrowingsetoflibraries.
▪ Contactifinterestedinjoining.▪ EuroTUG 2018:BetweenISCandPASC?
34