Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific...
Transcript of Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific...
![Page 1: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/1.jpg)
Developing GPU-Enabled Scientific Libraries
Matthew Knepley
Computation InstituteUniversity of Chicago
Department of Molecular Biology and PhysiologyRush University Medical Center
GPU–SMP 2011GPU Solutions to Multiscale Problems in Science and Engineering
Lanzhou, China July 18–21, 2011
M. Knepley (UC) GPU GPU-SMP 1 / 85
![Page 2: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/2.jpg)
Scientific Libraries
Outline
1 Scientific LibrariesWhat is PETSc?
2 Linear Systems
3 Assembly
4 Integration
5 Yet To be Done
M. Knepley (UC) GPU GPU-SMP 3 / 85
![Page 3: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/3.jpg)
Scientific Libraries
Main Point
To be widely accepted,
GPU computing must betransparent to the user,
and reuse existinginfrastructure.
M. Knepley (UC) GPU GPU-SMP 4 / 85
![Page 4: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/4.jpg)
Scientific Libraries
Main Point
To be widely accepted,
GPU computing must betransparent to the user,
and reuse existinginfrastructure.
M. Knepley (UC) GPU GPU-SMP 4 / 85
![Page 5: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/5.jpg)
Scientific Libraries
Main Point
To be widely accepted,
GPU computing must betransparent to the user,
and reuse existinginfrastructure.
M. Knepley (UC) GPU GPU-SMP 4 / 85
![Page 6: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/6.jpg)
Scientific Libraries
Lessons from Clusters and MPPs
FailureParallelizing CompilersAutomatic program decomposition
SuccessMPI (Library Approach)PETSc (Parallel Linear Algebra)User provides only the mathematical description
M. Knepley (UC) GPU GPU-SMP 5 / 85
![Page 7: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/7.jpg)
Scientific Libraries
Lessons from Clusters and MPPs
FailureParallelizing CompilersAutomatic program decomposition
SuccessMPI (Library Approach)PETSc (Parallel Linear Algebra)User provides only the mathematical description
M. Knepley (UC) GPU GPU-SMP 5 / 85
![Page 8: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/8.jpg)
Scientific Libraries What is PETSc?
Outline
1 Scientific LibrariesWhat is PETSc?
M. Knepley (UC) GPU GPU-SMP 6 / 85
![Page 9: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/9.jpg)
Scientific Libraries What is PETSc?
How did PETSc Originate?
PETSc was developed as a Platform forExperimentation
We want to experiment with differentModelsDiscretizationsSolversAlgorithms
which blur these boundaries
M. Knepley (UC) GPU GPU-SMP 7 / 85
![Page 10: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/10.jpg)
Scientific Libraries What is PETSc?
The Role of PETSc
Developing parallel, nontrivial PDE solvers thatdeliver high performance is still difficult and re-quires months (or even years) of concentratedeffort.
PETSc is a toolkit that can ease these difficul-ties and reduce the development time, but it isnot a black-box PDE solver, nor a silver bullet.— Barry Smith
M. Knepley (UC) GPU GPU-SMP 8 / 85
![Page 11: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/11.jpg)
Scientific Libraries What is PETSc?
Advice from Bill Gropp
You want to think about how you decompose your datastructures, how you think about them globally. [...] If youwere building a house, you’d start with a set of blueprintsthat give you a picture of what the whole house lookslike. You wouldn’t start with a bunch of tiles and say.“Well I’ll put this tile down on the ground, and then I’llfind a tile to go next to it.” But all too many people try tobuild their parallel programs by creating the smallestpossible tiles and then trying to have the structure oftheir code emerge from the chaos of all these littlepieces. You have to have an organizing principle ifyou’re going to survive making your code parallel.
(http://www.rce-cast.com/Podcast/rce-28-mpich2.html)
M. Knepley (UC) GPU GPU-SMP 9 / 85
![Page 12: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/12.jpg)
Scientific Libraries What is PETSc?
What is PETSc?
A freely available and supported researchcode for the parallel solution of nonlinearalgebraic equations
FreeDownload from http://www.mcs.anl.gov/petscFree for everyone, including industrial users
SupportedHyperlinked manual, examples, and manual pages for all routinesHundreds of tutorial-style examplesSupport via email: [email protected]
Usable from C, C++, Fortran 77/90, Matlab, Julia, and Python
M. Knepley (UC) GPU GPU-SMP 10 / 85
![Page 13: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/13.jpg)
Scientific Libraries What is PETSc?
What is PETSc?
Portable to any parallel system supporting MPI, including:Tightly coupled systems
Cray XT6, BG/Q, NVIDIA Fermi, K ComputerLoosely coupled systems, such as networks of workstations
IBM, Mac, iPad/iPhone, PCs running Linux or Windows
PETSc HistoryBegun September 1991Over 60,000 downloads since 1995 (version 2)Currently 400 per month
PETSc Funding and SupportDepartment of Energy
SciDAC, MICS Program, AMR Program, INL Reactor ProgramNational Science Foundation
CIG, CISE, Multidisciplinary Challenge Program
M. Knepley (UC) GPU GPU-SMP 11 / 85
![Page 14: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/14.jpg)
Scientific Libraries What is PETSc?
The PETSc Team
Bill Gropp Barry Smith Satish Balay
Jed Brown Matt Knepley Lisandro Dalcin
Hong Zhang Mark Adams Toby IssacM. Knepley (UC) GPU GPU-SMP 12 / 85
![Page 15: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/15.jpg)
Scientific Libraries What is PETSc?
Who Uses PETSc?
Computational Scientists
Earth SciencePyLith (CIG)Underworld (Monash)Magma Dynamics (LDEO, Columbia, Oxford)
Subsurface Flow and Porous MediaSTOMP (DOE)PFLOTRAN (DOE)
M. Knepley (UC) GPU GPU-SMP 13 / 85
![Page 16: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/16.jpg)
Scientific Libraries What is PETSc?
Who Uses PETSc?
Computational Scientists
CFDFiredrakeFluidityOpenFOAMfreeCFDOpenFVM
MicroMagneticsMagPar
FusionXGCBOUT++NIMROD
M. Knepley (UC) GPU GPU-SMP 14 / 85
![Page 17: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/17.jpg)
Scientific Libraries What is PETSc?
Who Uses PETSc?
Algorithm Developers
Iterative methodsDeflated GMRESLGMRESQCGSpecEst
Preconditioning researchersPrometheus (Adams)ParPre (Eijkhout)FETI-DP (Klawonn and Rheinbach)
M. Knepley (UC) GPU GPU-SMP 15 / 85
![Page 18: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/18.jpg)
Scientific Libraries What is PETSc?
Who Uses PETSc?
Algorithm Developers
Finite ElementslibMeshMOOSEPETSc-FEMDeal IIOOFEM
Other SolversFast Multipole Method (PetFMM)Radial Basis Function Interpolation (PetRBF)Eigensolvers (SLEPc)Optimization (TAO)
M. Knepley (UC) GPU GPU-SMP 16 / 85
![Page 19: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/19.jpg)
Scientific Libraries What is PETSc?
What Can We Handle?
PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media
PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL
PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code
M. Knepley (UC) GPU GPU-SMP 17 / 85
![Page 20: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/20.jpg)
Scientific Libraries What is PETSc?
What Can We Handle?
PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media
PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL
PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code
M. Knepley (UC) GPU GPU-SMP 17 / 85
![Page 21: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/21.jpg)
Scientific Libraries What is PETSc?
What Can We Handle?
PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media
PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL
PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code
M. Knepley (UC) GPU GPU-SMP 17 / 85
![Page 22: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/22.jpg)
Scientific Libraries What is PETSc?
Interface Questions
How should the user interact withmanycore systems?Through computational libraries
How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)
How should the library interact withmanycore systems?
Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions
M. Knepley (UC) GPU GPU-SMP 18 / 85
![Page 23: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/23.jpg)
Scientific Libraries What is PETSc?
Interface Questions
How should the user interact withmanycore systems?Through computational libraries
How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)
How should the library interact withmanycore systems?
Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions
M. Knepley (UC) GPU GPU-SMP 18 / 85
![Page 24: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/24.jpg)
Scientific Libraries What is PETSc?
Interface Questions
How should the user interact withmanycore systems?Through computational libraries
How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)
How should the library interact withmanycore systems?
Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions
M. Knepley (UC) GPU GPU-SMP 18 / 85
![Page 25: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/25.jpg)
Scientific Libraries What is PETSc?
Interface Questions
How should the user interact withmanycore systems?Through computational libraries
How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)
How should the library interact withmanycore systems?
Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions
M. Knepley (UC) GPU GPU-SMP 18 / 85
![Page 26: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/26.jpg)
Scientific Libraries What is PETSc?
Interface Questions
How should the user interact withmanycore systems?Through computational libraries
How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)
How should the library interact withmanycore systems?
Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions
M. Knepley (UC) GPU GPU-SMP 18 / 85
![Page 27: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/27.jpg)
Scientific Libraries What is PETSc?
Interface Questions
How should the user interact withmanycore systems?Through computational libraries
How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)
How should the library interact withmanycore systems?
Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions
M. Knepley (UC) GPU GPU-SMP 18 / 85
![Page 28: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/28.jpg)
Scientific Libraries What is PETSc?
Performance Analysis
In order to understand and predict the performance of GPU code, weneed:
good models for the computation, which make it possible to evaluatethe efficiency of an implementation;
a flop rate, which tells us how well we are utilizing the hardware;
timing, which is what users care about;
M. Knepley (UC) GPU GPU-SMP 19 / 85
![Page 29: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/29.jpg)
Linear Systems
Outline
1 Scientific Libraries
2 Linear Systems
3 Assembly
4 Integration
5 Yet To be Done
M. Knepley (UC) GPU GPU-SMP 20 / 85
![Page 30: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/30.jpg)
Linear Systems
Performance ExpectationsLinear Systems
The Sparse Matrix-Vector product (SpMV)is limited by system memory bandwidth,
rather than by peak flop rate.
We expect bandwidth ratio speedup (3x–6x for most systems)
Memory movement is more important than minimizing flops
Kernel is a vectorized, segmented sum (Blelloch, Heroux, andZagha: CMU-CS-93-173)
M. Knepley (UC) GPU GPU-SMP 21 / 85
![Page 31: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/31.jpg)
Linear Systems
Memory Bandwidth
All computations in this presentation are memory bandwidth limited.We have a bandwidth peak, the maximum flop rate achievable given abandwidth. This depends on β, the ratio of bytes transferred to flopsdone by the algorithm.
Processor BW (GB/s) Peak (GF/s) BW Peak∗ (GF/s)Core 2 Duo 4 34 1GeForce 9400M 21 54 5GTX 285 159 1062 40Tesla M2050 144 1030 36
∗Bandwidth peak is shown for β = 4
M. Knepley (UC) GPU GPU-SMP 22 / 85
![Page 32: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/32.jpg)
Linear Systems
STREAM Benchmark
Simple benchmark program measuring sustainable memory bandwidth
Protoypical operation is Triad (WAXPY): w = y + αxMeasures the memory bandwidth bottleneck (much below peak)Datasets outstrip cache
Machine Peak (MF/s) Triad (MB/s) MF/MW Eq. MF/sMatt’s Laptop 1700 1122.4 12.1 93.5 (5.5%)Intel Core2 Quad 38400 5312.0 57.8 442.7 (1.2%)Tesla 1060C 984000 102000.0* 77.2 8500.0 (0.8%)
Table: Bandwidth limited machine performance
http://www.cs.virginia.edu/stream/
M. Knepley (UC) GPU GPU-SMP 23 / 85
![Page 33: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/33.jpg)
Linear Systems
Analysis of Sparse Matvec (SpMV)
AssumptionsNo cache missesNo waits on memory references
Notationm Number of matrix rowsnz Number of nonzero matrix elementsV Number of vectors to multiply
We can look at bandwidth needed for peak performance(8 +
2V
)mnz
+6V
byte/flop (1)
or achieveable performance given a bandwith BWVnz
(8V + 2)m + 6nzBW Mflop/s (2)
Towards Realistic Performance Bounds for Implicit CFD Codes, Gropp,Kaushik, Keyes, and Smith.
M. Knepley (UC) GPU GPU-SMP 24 / 85
![Page 34: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/34.jpg)
Linear Systems
Linear Algebra Interfaces
Strong interfaces mean:
Easy code interoperability (LAPACK, Trilinos)
Easy portability (GPU)
Seamless optimization
M. Knepley (UC) GPU GPU-SMP 25 / 85
![Page 35: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/35.jpg)
Linear Systems
VECCUDA
Strategy: Define a new Vec implementation
Uses Thrust for data storage and operations on GPU
Supports full PETSc Vec interface
Inherits PETSc scalar type
Can be activated at runtime, -vec_type cuda
PETSc provides memory coherence mechanism
M. Knepley (UC) GPU GPU-SMP 26 / 85
![Page 36: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/36.jpg)
Linear Systems
MATAIJCUDA
Also define new Mat implementations
Uses Cusp for data storage and operations on GPU
Supports full PETSc Mat interface, some ops on CPU
Can be activated at runtime, -mat_type aijcuda
Notice that parallel matvec necessitates off-GPU data transfer
M. Knepley (UC) GPU GPU-SMP 27 / 85
![Page 37: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/37.jpg)
Linear Systems
Solvers
Solvers come for FreePreliminary Implementation of PETSc Using GPU,
Minden, Smith, Knepley, 2010
All linear algebra types work with solvers
Entire solve can take place on the GPUOnly communicate scalars back to CPU
GPU communication cost could be amortized over several solves
Preconditioners are a problemCusp has a promising AMG
M. Knepley (UC) GPU GPU-SMP 28 / 85
![Page 38: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/38.jpg)
Linear Systems
ExampleDriven Cavity Velocity-Vorticity with Multigrid
ex50 -da_vec_type seqcusp-da_mat_type aijcusp -mat_no_inode # Setup types-da_grid_x 100 -da_grid_y 100 # Set grid size-pc_type none -pc_mg_levels 1 # Setup solver-preload off -cuda_synchronize # Setup run-log_summary
M. Knepley (UC) GPU GPU-SMP 29 / 85
![Page 39: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/39.jpg)
Linear Systems
ExamplePFLOTRAN
Flow Solver32× 32× 32 grid
Routine Time (s) MFlops MFlops/sCPUKSPSolve 8.3167 4370 526MatMult 1.5031 769 512GPUKSPSolve 1.6382 4500 2745MatMult 0.3554 830 2337
P. Lichtner, G. Hammond,R. Mills, B. Phillip
M. Knepley (UC) GPU GPU-SMP 30 / 85
![Page 40: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/40.jpg)
Linear Systems
Serial PerformanceNVIDIA GeForce 9400M
M. Knepley (UC) GPU GPU-SMP 31 / 85
![Page 41: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/41.jpg)
Linear Systems
Serial PerformanceNVIDIA Tesla M2050
M. Knepley (UC) GPU GPU-SMP 32 / 85
![Page 42: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/42.jpg)
Linear Systems
Serial PerformanceNVIDIA Tesla M2050
M. Knepley (UC) GPU GPU-SMP 33 / 85
![Page 43: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/43.jpg)
Assembly
Outline
1 Scientific Libraries
2 Linear Systems
3 Assembly
4 Integration
5 Yet To be Done
M. Knepley (UC) GPU GPU-SMP 34 / 85
![Page 44: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/44.jpg)
Assembly
Performance ExpectationsMatrix Assembly
Matrix Assembly, aggregation of inputs,is also limited by memory bandwidth,
rather than by peak flop rate.
We expect bandwidth ratio speedup (3x–6x for most systems)
Input for FEM is a set of element matrices
Kernel is dominated by sort (submission to TOMS)
M. Knepley (UC) GPU GPU-SMP 35 / 85
![Page 45: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/45.jpg)
Assembly
Assembly Interface
A single new method is added:MatSetValuesBatch ( Mat J , Pe tsc In t Ne, Pe tsc In t Nl ,
Pe tsc In t *elemRows ,PetscScalar * elemMats )
Thus, a user just batches his input toachieve massive concurrency.
M. Knepley (UC) GPU GPU-SMP 36 / 85
![Page 46: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/46.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input
M. Knepley (UC) GPU GPU-SMP 37 / 85
![Page 47: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/47.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input
M. Knepley (UC) GPU GPU-SMP 37 / 85
![Page 48: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/48.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input
M. Knepley (UC) GPU GPU-SMP 37 / 85
![Page 49: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/49.jpg)
Assembly
Convenience Iterators
repeated_range < I n d e xA r r a y I t e r a t o r >rowInd ( elemRows . begin ( ) , elemRows . end ( ) , Nl ) ;
t i l ed_range < I nd e x A r r ay I t e r a to r >co l I nd ( elemRows . begin ( ) , elemRows . end ( ) , Nl , Nl ) ;
Nl = 3elemRows 0 1 3rowInd 0 0 0 | 1 1 1 | 3 3 3colInd 0 1 3 | 0 1 3 | 0 1 3
M. Knepley (UC) GPU GPU-SMP 38 / 85
![Page 50: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/50.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column
1 Get permutation from (stably) sorting columns2 Gather rows with this permutation3 Get permutation from (stably) sorting rows4 Gather columns with this permutation5 Gather values with this permutation
M. Knepley (UC) GPU GPU-SMP 39 / 85
![Page 51: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/51.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column
1 Get permutation from (stably) sorting columns2 Gather rows with this permutation3 Get permutation from (stably) sorting rows4 Gather columns with this permutation5 Gather values with this permutation
M. Knepley (UC) GPU GPU-SMP 39 / 85
![Page 52: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/52.jpg)
Assembly
Multikey Sort
Initial input(1 0)(3 1)(0 0)(1 1)(3 3)(0 1)(0 3)(3 0)(1 3)
M. Knepley (UC) GPU GPU-SMP 40 / 85
![Page 53: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/53.jpg)
Assembly
Multikey Sort
Number pairs Index(1 0) 0(3 1) 1(0 0) 2(1 1) 3(3 3) 4(0 1) 5(0 3) 6(3 0) 7(1 3) 8
M. Knepley (UC) GPU GPU-SMP 41 / 85
![Page 54: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/54.jpg)
Assembly
Multikey Sort
After stable sort of columns Index(1 0) 0(0 0) 2(3 0) 7(3 1) 1(1 1) 3(0 1) 5(3 3) 4(0 3) 6(1 3) 8
M. Knepley (UC) GPU GPU-SMP 42 / 85
![Page 55: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/55.jpg)
Assembly
Multikey Sort
After gather of rowsusing column permutation,and implicit renumbering
Index(1 0) 0(0 0) 1(3 0) 2(3 1) 3(1 1) 4(0 1) 5(3 3) 6(0 3) 7(1 3) 8
M. Knepley (UC) GPU GPU-SMP 43 / 85
![Page 56: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/56.jpg)
Assembly
Multikey Sort
After stable sort of rows,and gather of columnsusing row permutation
Index(0 0) 1(0 1) 5(0 3) 7(1 0) 0(1 1) 4(1 3) 8(3 0) 2(3 1) 3(3 3) 6
M. Knepley (UC) GPU GPU-SMP 44 / 85
![Page 57: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/57.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column5 Compute number of unique (i,j) entries using inner_product()
M. Knepley (UC) GPU GPU-SMP 45 / 85
![Page 58: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/58.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column5 Compute number of unique (i,j) entries using inner_product()
M. Knepley (UC) GPU GPU-SMP 45 / 85
![Page 59: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/59.jpg)
Assembly
Counting Unique Entries
Initial input (0 0)(0 1)(0 1)(0 3)(1 0)(1 1)(3 0)(3 0)(3 0)
M. Knepley (UC) GPU GPU-SMP 46 / 85
![Page 60: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/60.jpg)
Assembly
Counting Unique Entries
Duplicate input (0 0) (0 0)(0 1) (0 1)(0 1) (0 1)(0 3) (0 3)(1 0) (1 0)(1 1) (1 1)(3 0) (3 0)(3 0) (3 0)(3 0) (3 0)
M. Knepley (UC) GPU GPU-SMP 47 / 85
![Page 61: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/61.jpg)
Assembly
Counting Unique Entries
Shift new sequenceand truncate initial input
(0 0) (0 1)(0 1) (0 1)(0 1) (0 3)(0 3) (1 0)(1 0) (1 1)(1 1) (3 0)(3 0) (3 0)(3 0) (3 0)
M. Knepley (UC) GPU GPU-SMP 48 / 85
![Page 62: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/62.jpg)
Assembly
Counting Unique Entries
“Multiply entries” usingnot-equals binary operator
(0 0) (0 1) =⇒ 1(0 1) (0 1) =⇒ 0(0 1) (0 3) =⇒ 1(0 3) (1 0) =⇒ 1(1 0) (1 1) =⇒ 1(1 1) (3 0) =⇒ 1(3 0) (3 0) =⇒ 0(3 0) (3 0) =⇒ 0
M. Knepley (UC) GPU GPU-SMP 49 / 85
![Page 63: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/63.jpg)
Assembly
Counting Unique Entries
Reduction of entries plus 1gives number of uniqueentries
1(0 0) (0 1) =⇒ 1(0 1) (0 1) =⇒ 0(0 1) (0 3) =⇒ 1(0 3) (1 0) =⇒ 1(1 0) (1 1) =⇒ 1(1 1) (3 0) =⇒ 1(3 0) (3 0) =⇒ 0(3 0) (3 0) =⇒ 0
6
M. Knepley (UC) GPU GPU-SMP 50 / 85
![Page 64: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/64.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column5 Compute number of unique (i,j) entries using inner_product()
6 Allocate COO storage for final matrix7 Sum values with the same (i,j) index using reduce_by_key()
8 Convert to AIJ matrix9 Copy from GPU (if necessary)
M. Knepley (UC) GPU GPU-SMP 51 / 85
![Page 65: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/65.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column5 Compute number of unique (i,j) entries using inner_product()
6 Allocate COO storage for final matrix7 Sum values with the same (i,j) index using reduce_by_key()
8 Convert to AIJ matrix9 Copy from GPU (if necessary)
M. Knepley (UC) GPU GPU-SMP 51 / 85
![Page 66: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/66.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column5 Compute number of unique (i,j) entries using inner_product()
6 Allocate COO storage for final matrix7 Sum values with the same (i,j) index using reduce_by_key()
8 Convert to AIJ matrix9 Copy from GPU (if necessary)
M. Knepley (UC) GPU GPU-SMP 51 / 85
![Page 67: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/67.jpg)
Assembly
Serial Assembly Steps
1 Copy elemRows and elemMat to device2 Allocate storage for intermediate COO matrix3 Use repeat&tile iterators to expand row input4 Sort COO matrix by row and column5 Compute number of unique (i,j) entries using inner_product()
6 Allocate COO storage for final matrix7 Sum values with the same (i,j) index using reduce_by_key()
8 Convert to AIJ matrix9 Copy from GPU (if necessary)
M. Knepley (UC) GPU GPU-SMP 51 / 85
![Page 68: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/68.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes
1 Find number of off-process rows (serial)2 Map rows to processes (serial)3 Send number of rows to each process (collective)
M. Knepley (UC) GPU GPU-SMP 52 / 85
![Page 69: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/69.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes
1 Find number of off-process rows (serial)2 Map rows to processes (serial)3 Send number of rows to each process (collective)
M. Knepley (UC) GPU GPU-SMP 52 / 85
![Page 70: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/70.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes
1 Find number of off-process rows (serial)2 Map rows to processes (serial)3 Send number of rows to each process (collective)
M. Knepley (UC) GPU GPU-SMP 52 / 85
![Page 71: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/71.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries
1 Partition into diagonal and off-diagonal&off-process usingpartition_copy ()
2 Partition again into off-diagonal and off-process usingstable_partition ()
M. Knepley (UC) GPU GPU-SMP 53 / 85
![Page 72: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/72.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries
1 Partition into diagonal and off-diagonal&off-process usingpartition_copy ()
2 Partition again into off-diagonal and off-process usingstable_partition ()
M. Knepley (UC) GPU GPU-SMP 53 / 85
![Page 73: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/73.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries
1 Partition into diagonal and off-diagonal&off-process usingpartition_copy ()
2 Partition again into off-diagonal and off-process usingstable_partition ()
M. Knepley (UC) GPU GPU-SMP 53 / 85
![Page 74: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/74.jpg)
Assembly
Partitioning EntriesProcess owns rows [0, 3)
Initial input
(0,0) · · · (0,2) (0,3)...
. . .... (0,3)
(2,0) · · · (2,2) (0,3)(3,0) (3,1) (3,2) (3,3)
(3 0)(0 1)(3 3)(0 3)(0 0)(3 1)(1 3)(1 1)(1 0)
M. Knepley (UC) GPU GPU-SMP 54 / 85
![Page 75: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/75.jpg)
Assembly
Partitioning EntriesProcess owns rows [0, 3)
Partition intodiagonal, andoff-diagonal &off-process entries
Diagonal
(0 0)(1 1)(0 1)(1 0)
Off-diagonalandOff-process
(3 1)(3 0)(1 3)(3 3)(0 3)
M. Knepley (UC) GPU GPU-SMP 55 / 85
![Page 76: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/76.jpg)
Assembly
Partitioning EntriesProcess owns rows [0, 3)
Partition again intooff-diagonal andoff-process entries
Diagonal
(0 0)(1 1)(0 1)(1 0)
Off-diagonal(1 3)(0 3)
Off-process(3 1)(3 0)(3 3)
M. Knepley (UC) GPU GPU-SMP 56 / 85
![Page 77: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/77.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries6 Send off-process entries7 Allocate storage for intermediate off-diagonal COO matrix8 Repartition entries into diagonal and off-diagonal using
partition_copy ()
9 Repeat serial assembly on both matrices
M. Knepley (UC) GPU GPU-SMP 57 / 85
![Page 78: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/78.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries6 Send off-process entries7 Allocate storage for intermediate off-diagonal COO matrix8 Repartition entries into diagonal and off-diagonal using
partition_copy ()
9 Repeat serial assembly on both matrices
M. Knepley (UC) GPU GPU-SMP 57 / 85
![Page 79: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/79.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries6 Send off-process entries7 Allocate storage for intermediate off-diagonal COO matrix8 Repartition entries into diagonal and off-diagonal using
partition_copy ()
9 Repeat serial assembly on both matrices
M. Knepley (UC) GPU GPU-SMP 57 / 85
![Page 80: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/80.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries6 Send off-process entries7 Allocate storage for intermediate off-diagonal COO matrix8 Repartition entries into diagonal and off-diagonal using
partition_copy ()
9 Repeat serial assembly on both matrices
M. Knepley (UC) GPU GPU-SMP 57 / 85
![Page 81: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/81.jpg)
Assembly
Parallel Assembly Steps
1 Copy elemRows and elemMat to device2 Use repeat&tile iterators to expand row input3 Communicate off-process entry sizes4 Allocate storage for intermediate diagonal COO matrix5 Partition entries6 Send off-process entries7 Allocate storage for intermediate off-diagonal COO matrix8 Repartition entries into diagonal and off-diagonal using
partition_copy ()
9 Repeat serial assembly on both matrices
M. Knepley (UC) GPU GPU-SMP 57 / 85
![Page 82: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/82.jpg)
Assembly
Serial PerformanceNVIDIA GTX 285
M. Knepley (UC) GPU GPU-SMP 58 / 85
![Page 83: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/83.jpg)
Integration
Outline
1 Scientific Libraries
2 Linear Systems
3 Assembly
4 IntegrationAnalytic FlexibilityComputational FlexibilityEfficiency
5 Yet To be Done
M. Knepley (UC) GPU GPU-SMP 59 / 85
![Page 84: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/84.jpg)
Integration
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley (UC) GPU GPU-SMP 60 / 85
![Page 85: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/85.jpg)
Integration
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley (UC) GPU GPU-SMP 60 / 85
![Page 86: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/86.jpg)
Integration
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley (UC) GPU GPU-SMP 60 / 85
![Page 87: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/87.jpg)
Integration
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley (UC) GPU GPU-SMP 60 / 85
![Page 88: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/88.jpg)
Integration Analytic Flexibility
Outline
4 IntegrationAnalytic FlexibilityComputational FlexibilityEfficiency
M. Knepley (UC) GPU GPU-SMP 61 / 85
![Page 89: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/89.jpg)
Integration Analytic Flexibility
Analytic FlexibilityLaplacian
∫T∇φi(x) · ∇φj(x)dx (3)
element = F in i teE lement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner ( grad ( v ) , grad ( u ) ) * dx
M. Knepley (UC) GPU GPU-SMP 62 / 85
![Page 90: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/90.jpg)
Integration Analytic Flexibility
Analytic FlexibilityLaplacian
∫T∇φi(x) · ∇φj(x)dx (3)
element = F in i teE lement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner ( grad ( v ) , grad ( u ) ) * dx
M. Knepley (UC) GPU GPU-SMP 62 / 85
![Page 91: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/91.jpg)
Integration Analytic Flexibility
Analytic FlexibilityLinear Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
):(∇~φj(x) +∇~φj(x)
)dx (4)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner (sym( grad ( v ) ) , sym( grad ( u ) ) ) * dx
M. Knepley (UC) GPU GPU-SMP 63 / 85
![Page 92: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/92.jpg)
Integration Analytic Flexibility
Analytic FlexibilityLinear Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
):(∇~φj(x) +∇~φj(x)
)dx (4)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner (sym( grad ( v ) ) , sym( grad ( u ) ) ) * dx
M. Knepley (UC) GPU GPU-SMP 63 / 85
![Page 93: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/93.jpg)
Integration Analytic Flexibility
Analytic FlexibilityFull Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
): C :
(∇~φj(x) +∇~φj(x)
)dx (5)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,
( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx
Currently broken in FEniCS release
M. Knepley (UC) GPU GPU-SMP 64 / 85
![Page 94: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/94.jpg)
Integration Analytic Flexibility
Analytic FlexibilityFull Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
): C :
(∇~φj(x) +∇~φj(x)
)dx (5)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,
( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx
Currently broken in FEniCS release
M. Knepley (UC) GPU GPU-SMP 64 / 85
![Page 95: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/95.jpg)
Integration Analytic Flexibility
Analytic FlexibilityFull Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
): C :
(∇~φj(x) +∇~φj(x)
)dx (5)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,
( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx
Currently broken in FEniCS release
M. Knepley (UC) GPU GPU-SMP 64 / 85
![Page 96: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/96.jpg)
Integration Analytic Flexibility
Form Decomposition
Element integrals are decomposed into analytic and geometric parts:
∫T ∇φi(x) · ∇φj(x)dx (6)
=∫T∂φi (x)∂xα
∂φj (x)∂xα dx (7)
=∫Tref
∂ξβ∂xα
∂φi (ξ)∂ξβ
∂ξγ∂xα
∂φj (ξ)∂ξγ|J|dx (8)
=∂ξβ∂xα
∂ξγ∂xα |J|
∫Tref
∂φi (ξ)∂ξβ
∂φj (ξ)∂ξγ
dx (9)
= Gβγ(T )K ijβγ (10)
Coefficients are also put into the geometric part.
M. Knepley (UC) GPU GPU-SMP 65 / 85
![Page 97: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/97.jpg)
Integration Analytic Flexibility
Weak Form Processing
from f f c . ana l ys i s impor t analyze_formsfrom f f c . compi ler impor t compute_ir
parameters = f f c . defau l t_parameters ( )parameters [ ’ r ep resen ta t i on ’ ] = ’ tensor ’ana l ys i s = analyze_forms ( [ a , L ] , { } , parameters )i r = compute_ir ( ana lys is , parameters )
a_K = i r [ 2 ] [ 0 ] [ ’AK ’ ] [ 0 ] [ 0 ]a_G = i r [ 2 ] [ 0 ] [ ’AK ’ ] [ 0 ] [ 1 ]
K = a_K . A0 . astype (numpy . f l o a t 3 2 )G = a_G
M. Knepley (UC) GPU GPU-SMP 66 / 85
![Page 98: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/98.jpg)
Integration Computational Flexibility
Outline
4 IntegrationAnalytic FlexibilityComputational FlexibilityEfficiency
M. Knepley (UC) GPU GPU-SMP 67 / 85
![Page 99: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/99.jpg)
Integration Computational Flexibility
Computational Flexibility
We generate different computations on the fly,
and can changeElement Batch Size
Number of Concurrent Elements
Loop unrolling
Interleaving stores with computation
M. Knepley (UC) GPU GPU-SMP 68 / 85
![Page 100: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/100.jpg)
Integration Computational Flexibility
Computational FlexibilityBasic Contraction
G K
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 69 / 85
![Page 101: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/101.jpg)
Integration Computational Flexibility
Computational FlexibilityBasic Contraction
G Kthread 0
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 69 / 85
![Page 102: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/102.jpg)
Integration Computational Flexibility
Computational FlexibilityBasic Contraction
G Kthread 0
thread 5
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 69 / 85
![Page 103: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/103.jpg)
Integration Computational Flexibility
Computational FlexibilityBasic Contraction
G Kthread 0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 69 / 85
![Page 104: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/104.jpg)
Integration Computational Flexibility
Computational FlexibilityElement Batch Size
G0
G1
G2
G3
Kthread 0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 70 / 85
![Page 105: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/105.jpg)
Integration Computational Flexibility
Computational FlexibilityElement Batch Size
G0
G1
G2
G3
K
thread 0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 70 / 85
![Page 106: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/106.jpg)
Integration Computational Flexibility
Computational FlexibilityElement Batch Size
G0
G1
G2
G3
K
thre
ad0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 70 / 85
![Page 107: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/107.jpg)
Integration Computational Flexibility
Computational FlexibilityElement Batch Size
G0
G1
G2
G3
K
thre
ad0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 70 / 85
![Page 108: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/108.jpg)
Integration Computational Flexibility
Computational FlexibilityConcurrent Elements
G00
G01
G02
G03
G10
G11
G12
G13
Kthread 0
thread 5
thread 15
thread 16
thread 21
thre
ad31
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 71 / 85
![Page 109: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/109.jpg)
Integration Computational Flexibility
Computational FlexibilityConcurrent Elements
G00
G01
G02
G03
G10
G11
G12
G13
K
thread 0
thread 5
thread 15
thread 16thread 21
thre
ad31Figure: Tensor Contraction Gβγ(T )K ij
βγ
M. Knepley (UC) GPU GPU-SMP 71 / 85
![Page 110: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/110.jpg)
Integration Computational Flexibility
Computational FlexibilityConcurrent Elements
G00
G01
G02
G03
G10
G11
G12
G13
Kth
read
0
thread 5
thread 15
thread 16thread 21
thread 31
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 71 / 85
![Page 111: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/111.jpg)
Integration Computational Flexibility
Computational FlexibilityConcurrent Elements
G00
G01
G02
G03
G10
G11
G12
G13
Kth
read
0
thread 5
thread 15
thread 16thread 21
thread 31
Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley (UC) GPU GPU-SMP 71 / 85
![Page 112: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/112.jpg)
Integration Computational Flexibility
Computational FlexibilityLoop Unrolling
/ * G K c o n t r a c t i o n : u n r o l l = f u l l * /E [ 0 ] += G[ 0 ] * K [ 0 ] ;E [ 0 ] += G[ 1 ] * K [ 1 ] ;E [ 0 ] += G[ 2 ] * K [ 2 ] ;E [ 0 ] += G[ 3 ] * K [ 3 ] ;E [ 0 ] += G[ 4 ] * K [ 4 ] ;E [ 0 ] += G[ 5 ] * K [ 5 ] ;E [ 0 ] += G[ 6 ] * K [ 6 ] ;E [ 0 ] += G[ 7 ] * K [ 7 ] ;E [ 0 ] += G[ 8 ] * K [ 8 ] ;
M. Knepley (UC) GPU GPU-SMP 72 / 85
![Page 113: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/113.jpg)
Integration Computational Flexibility
Computational FlexibilityLoop Unrolling
/ * G K c o n t r a c t i o n : u n r o l l = none * /f o r ( i n t b = 0; b < 1; ++b ) {
const i n t n = b * 1 ;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {
f o r ( i n t beta = 0; beta < 3; ++beta ) {E [ b ] += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;
}}
}
M. Knepley (UC) GPU GPU-SMP 73 / 85
![Page 114: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/114.jpg)
Integration Computational Flexibility
Computational FlexibilityInterleaving stores
/ * G K c o n t r a c t i o n : u n r o l l = none * /f o r ( i n t b = 0; b < 4; ++b ) {
const i n t n = b * 1 ;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {
f o r ( i n t beta = 0; beta < 3; ++beta ) {E [ b ] += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;
}}
}/ * Store c o n t r a c t i o n r e s u l t s * /elemMat [ Eo f f se t + idx +0] = E [ 0 ] ;elemMat [ Eo f f se t + idx +16] = E [ 1 ] ;elemMat [ Eo f f se t + idx +32] = E [ 2 ] ;elemMat [ Eo f f se t + idx +48] = E [ 3 ] ;
M. Knepley (UC) GPU GPU-SMP 74 / 85
![Page 115: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/115.jpg)
Integration Computational Flexibility
Computational FlexibilityInterleaving stores
n = 0;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {
f o r ( i n t beta = 0; beta < 3; ++beta ) {E += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;
}}/ * Store c o n t r a c t i o n r e s u l t * /elemMat [ Eo f f se t + idx +0] = E;n = 1; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +16] = E;n = 2; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +32] = E;n = 3; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +48] = E;
M. Knepley (UC) GPU GPU-SMP 75 / 85
![Page 116: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/116.jpg)
Integration Efficiency
Outline
4 IntegrationAnalytic FlexibilityComputational FlexibilityEfficiency
M. Knepley (UC) GPU GPU-SMP 76 / 85
![Page 117: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/117.jpg)
Integration Efficiency
PerformanceInfluence of Element Batch Sizes
M. Knepley (UC) GPU GPU-SMP 77 / 85
![Page 118: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/118.jpg)
Integration Efficiency
PerformanceInfluence of Element Batch Sizes
M. Knepley (UC) GPU GPU-SMP 78 / 85
![Page 119: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/119.jpg)
Integration Efficiency
PerformanceInfluence of Code Structure
M. Knepley (UC) GPU GPU-SMP 79 / 85
![Page 120: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/120.jpg)
Integration Efficiency
PerformanceInfluence of Code Structure
M. Knepley (UC) GPU GPU-SMP 80 / 85
![Page 121: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/121.jpg)
Integration Efficiency
Performance
Price-Performance Comparison of CPU and GPU3D P1 Laplacian Integration
Model Price ($) GF/s MF/s$GTX285 390 90 231Core 2 Duo 300 2 6.6
∗ Jed Brown Optimization Engine
M. Knepley (UC) GPU GPU-SMP 81 / 85
![Page 122: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/122.jpg)
Integration Efficiency
Performance
Price-Performance Comparison of CPU and GPU3D P1 Laplacian Integration
Model Price ($) GF/s MF/s$GTX285 390 90 231Core 2 Duo 300 12∗ 40
∗ Jed Brown Optimization Engine
M. Knepley (UC) GPU GPU-SMP 81 / 85
![Page 123: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/123.jpg)
Yet To be Done
Outline
1 Scientific Libraries
2 Linear Systems
3 Assembly
4 Integration
5 Yet To be Done
M. Knepley (UC) GPU GPU-SMP 82 / 85
![Page 124: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/124.jpg)
Yet To be Done
Competing Models
How should modern scientificcomputing be structured?
Current Model: PETSCSingle languageHand optimized3rd party librariesnew hardware
Alternative Model: PetCLAWMultiple language through PythonOptimization through code generation3rd party libaries through wrappersNew hardware through code generation
M. Knepley (UC) GPU GPU-SMP 83 / 85
![Page 125: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/125.jpg)
Yet To be Done
Competing Models
How should modern scientificcomputing be structured?
Current Model: PETSCSingle languageHand optimized3rd party librariesnew hardware
Alternative Model: PetCLAWMultiple language through PythonOptimization through code generation3rd party libaries through wrappersNew hardware through code generation
M. Knepley (UC) GPU GPU-SMP 83 / 85
![Page 126: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/126.jpg)
Yet To be Done
Competing Models
How should modern scientificcomputing be structured?
Current Model: PETSCSingle languageHand optimized3rd party librariesnew hardware
Alternative Model: PetCLAWMultiple language through PythonOptimization through code generation3rd party libaries through wrappersNew hardware through code generation
M. Knepley (UC) GPU GPU-SMP 83 / 85
![Page 127: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/127.jpg)
Yet To be Done
Competing Models
How should modern scientificcomputing be structured?
Current Model: PETSCSingle languageHand optimized3rd party librariesnew hardware
Alternative Model: PetCLAWMultiple language through PythonOptimization through code generation3rd party libaries through wrappersNew hardware through code generation
M. Knepley (UC) GPU GPU-SMP 83 / 85
![Page 128: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/128.jpg)
Yet To be Done
New Model for Scientific Software
Application
FFC/SyFieqn. definitionsympy symbolics
numpyda
tast
ruct
ures
petsc4py
solve
rs
PyCUDA
integration/assembly
PETScCUDA
OpenCL
Figure: Schematic for a generic scientific applicationM. Knepley (UC) GPU GPU-SMP 84 / 85
![Page 129: Developing GPU-Enabled Scientific Librariesmk51/presentations/PresLanzhou2011.pdfScientific Libraries Main Point To be widely accepted, GPU computing must be transparent to the user,](https://reader033.fdocuments.in/reader033/viewer/2022051810/60161655c9b9cd632b0392b5/html5/thumbnails/129.jpg)
Yet To be Done
What Do We Still Need?
Better integration of code generationMatch CUDA driver interface to CUDA runtime interface
Extend code generation to quadrature schemes
Kernel fusion in assembly
Better hierarchical parallelismLarger scale parallel GPU tests
Synchronization reduction in current algorithms
M. Knepley (UC) GPU GPU-SMP 85 / 85