Nvidia GTC 2014 Talk

35
04/01/14 1 Establishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled Teaching and Research William J. Brouwer ([email protected]) Pierre-Yves Taunay ([email protected]) Research Computing and Cyberinfrastructure The Pennsylvania State University Nvidia GTC 2014

description

Penn State RCC has been a CUDA research center for the last year; this talk provides success stories and challenges. GPU case studies are given, including algorithm details and performance results.

Transcript of Nvidia GTC 2014 Talk

Page 1: Nvidia GTC 2014 Talk

04/01/14 1

Establishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled

Teaching and Research

William J. Brouwer ([email protected])Pierre-Yves Taunay ([email protected])

Research Computing and CyberinfrastructureThe Pennsylvania State University

Nvidia GTC 2014

Page 2: Nvidia GTC 2014 Talk

04/01/14 2

Outline● Center Overview (RCC @ PSU)● GPU accelerated research

● IceCube● Metabolic Networks (Fsolve/cuSolve)● MD + Simulated Annealing

● FQHE (LU Decomposition)

● Smart Proppants (QR Decomposition)

● GPU cluster scaling

● Amber● PetaChem● Quantum Espresso

– Lanczos Diagonalization● CUDA, needs + wants

● Summary

Nvidia GTC 2014

Page 3: Nvidia GTC 2014 Talk

04/01/14 3

Center Overview● Research Computing and Cyberinfrastructure (RCC) at PSU

provides high performance computing services :

● Hardware, proprietary/open source software

● Consultation (numerical/algorithmic, software development etc)

● PhD's, system admins and programmers work together to providethese services to academics while performing independentresearch

● Many users are interested in using GPUs for science and engineering research applications, we are a CUDA research center https://research.nvidia.com/content/penn-state-crc-summary

● Formerly under ITS, currently incorporating into Office of the Vice President for Research (OVPR)

Nvidia GTC 2014

Page 4: Nvidia GTC 2014 Talk

04/01/14 4

Center Overview● Hardware is ~ 12K CPU cores, 64 GPUs (Fermi), several Kepler

● Red Hat Linux, scheduling via PBS/Moab/Torque

● Usual monitoring/management tools eg., Puppet, Jenkins, Nagios, Ganglia, and some custom solution(s) ( eg., CLPR)

● Serve ~ 7k users, all campuses in the commonwealth

● Use CUDA predominantly, although growing numbers of users trying OpenACC, OpenCL, libraries etc

● Environment modules system

Nvidia GTC 2014

Page 5: Nvidia GTC 2014 Talk

04/01/14 5

Center Overview● Support many GPU accelerated applications

Nvidia GTC 2014

Page 6: Nvidia GTC 2014 Talk

04/01/14 6

Outline● Center Overview (RCC @ PSU)● GPU accelerated research

● IceCube● Metabolic Networks (Fsolve/cuSolve)● MD + Simulated Annealing

● FQHE (LU Decomposition)

● Smart Proppants (QR Decomposition)

● GPU cluster scaling

● Amber● PetaChem● Quantum Espresso

– Lanczos Diagonalization● CUDA, needs + wants

● Summary

Nvidia GTC 2014

Page 7: Nvidia GTC 2014 Talk

04/01/14 7Nvidia GTC 2014

IceCube

Page 8: Nvidia GTC 2014 Talk

04/01/14 8

Metabolic Networks● Optimal models for the metabolic networks of microbial organisms

important in pharma, energy industries

● Ensemble Modeling (EM) is used to construct chemical kinetics of microbial organisms → decompose metabolic reactions into the elementary mechanisms, which are ODE systems f(k

i,y

j) = dy

j/dt

Nvidia GTC 2014

● Overall approach maximizes correlation between model predictions and experimental measurements, performed in steady state → solve f(k,y) = 0

Page 9: Nvidia GTC 2014 Talk

04/01/14 9

Metabolic Networks

● [CPU] parse equations f(k,y)● [CPU] differentiate f(k,y), create analytic J(k,y)● [CPU] populate data structures representing f(k,y), J(k,y),

copy to GPU● [GPU] Iterate (Newton-Raphson) →

● Numerically evaluate f(k,y) and J(k,y) by parallel reduction

● Solve for delta in f(k,y) = -delta . J(k,y) using GMRES ● Update y += delta and repeat until ||f(k,y)|| < tol

Nvidia GTC 2014

Page 10: Nvidia GTC 2014 Talk

04/01/14 10

Metabolic Networks

Nvidia GTC 2014

● Solution uses various libraries including Boost, Thrust, CUSP and CUDA

● Matrices sparse, poorly conditioned, but solution works well for O(10^2) equations

● Currently working to scale to larger, more interesting networks and microbial organisms

● CuSolve is a work in progress, a GPU-only ODE solve for stiff equations

Page 11: Nvidia GTC 2014 Talk

04/01/14 11

Molecular Dynamics + Sim Anneal

Nvidia GTC 2014

● Solve for MD potentials by fitting experimental data for structure factor

● Optimization surface (below) is highly non-convex → use simulated annealing, each GPU performs independent MD run

Page 12: Nvidia GTC 2014 Talk

04/01/14 12

LU Decomposition

Nvidia GTC 2014

● Batch LU decomposition developed for fractional quantum Hall effect, fundamental physics that has implications in quantum computation and material science

● O(N!) determinants need to be evaluated in constructing wavefunction, process repeated many times in Monte Carlo calculation

● Small, dense matrices of side <= 512

● Implementation exploits SIMD architecture, parallel reduction

● Example; N=11, computation time using 8 GPU devices (w/ MPI), 1024 Monte Carlo iterations is ~ 246 seconds from ~ 31488 single CPU

Page 13: Nvidia GTC 2014 Talk

04/01/14 13

LU Decomposition

Nvidia GTC 2014

Page 14: Nvidia GTC 2014 Talk

04/01/14 14

QR Decomposition

Nvidia GTC 2014

● Proppant materials used to stabilize fissures created during hydraulic fracturing

● 'Smart proppants' are essentially electrical dipoles which may absorb and re-emit EM energy, irradiated and recorded by downhole instrumentation

● This work considers an iteration-free solution to this EM scattering problem, uses linear algebra including LU and SVD decomposition

● SVD can be performed using the QR algorithm, in turn a function of QR decomposition

● Devised a unique approach for large batches of dense small matrices using Givens rotations; largely independent ops, maps well to GPU

Page 15: Nvidia GTC 2014 Talk

04/01/14 15

QR Decomposition

Nvidia GTC 2014

Page 16: Nvidia GTC 2014 Talk

04/01/14 16

Outline● Center Overview (RCC @ PSU)● GPU accelerated research

● IceCube● Metabolic Networks (Fsolve/cuSolve)● MD + Simulated Annealing

● FQHE (LU Decomposition)

● Smart Proppants (QR Decomposition)

● GPU cluster scaling

● Amber● PetaChem● Quantum Espresso

– Lanczos Diagonalization● CUDA, needs + wants

● Summary

Nvidia GTC 2014

Page 17: Nvidia GTC 2014 Talk

04/01/14 17

GPU Cluster Scaling

Nvidia GTC 2014

● Several key GPU accelerated software suites were tested using multiple GPUs across two clusters

Cluster Lion-GA Stampede

CPU 12 X5675 @ 3.07 GHz 16 E5-2680 @ 2.70 GHz

GPU 8 M2070 or 8 M2090 1 K20cNodes equipped with

GPUs8 120

Interconnect 40 Gb/s Mellanox QDR Infiniband

56 Gb/s Mellanox FDR Infiniband

Page 18: Nvidia GTC 2014 Talk

04/01/14 18

GPU Cluster Scaling

Nvidia GTC 2014

● Lion-GA cluster has 3 GPUs per PCIe switch, 3 to 5 GPUs per IOH chip

● IOH doesn't support peer to peer transfers between GPU devices on different chipsets

● Difficult to achieve peak transfer rates across GPU on different sockets

Page 19: Nvidia GTC 2014 Talk

04/01/14 19

Amber

Nvidia GTC 2014

● Molecular Dynamics is widely used for simulation of solvated proteins or molecules and make use of various force fields (AMBER, ReaxFF, etc.)

● AMBER force field is implemented in the eponymous software suite

● The software PMEMD in AMBER is used for both explicit solvent Particle Mesh Ewald (PME) and implicit solvent General Borne (GB) simulations

● AMBER does not require extensive communication between GPUs or between CPU and GPU, and does not take advantage of the CPU if GPUs are used

● GPU acceleration allows for longer simulation times ~ nanosecond or more

Page 20: Nvidia GTC 2014 Talk

04/01/14 20Nvidia GTC 2014

12 X5675 2 M2090 4 M2090 6 M2090 8 M2090

01

02

03

04

05

06

07

08

0

PME simulation of DHFR protein in water (NPT ensemble, 23,558 atoms)

Achieved performance on Lion-GA

ns/

day

Amber

Page 21: Nvidia GTC 2014 Talk

04/01/14 21Nvidia GTC 2014

12 X5675 2 M2090 4 M2090 6 M2090 8 M2090

02

46

81

01

21

41

61

8

PME simulation of FactorIX molecule in water (NPT ensemble, 90,906 atoms)

Achieved performance on Lion-GA

ns/

day

Amber

Page 22: Nvidia GTC 2014 Talk

04/01/14 22Nvidia GTC 2014

12 X5675 2 M2090 4 M2090 6 M2090 8 M2090

00

.51

1.5

22

.53

3.5

44

.5

PME simulation of Cellulose molecule in water (NPT ensemble, 408,609 atoms)

Achieved performance on Lion-GA

ns/

day

Amber

Page 23: Nvidia GTC 2014 Talk

04/01/14 23Nvidia GTC 2014

12 X5675 2 M2090 4 M2090 6 M2090 8 M2090

05

01

00

15

02

00

Implicit solvent GB simulation of Myoglobin (2,492 atoms)

Achieved performance on Lion-GA

ns/

day

Amber

Page 24: Nvidia GTC 2014 Talk

04/01/14 24Nvidia GTC 2014

12 X5675 2 M2090 4 M2090 6 M2090 8 M2090

01

23

45

67

Implicit solvent GB simulation of Nucleosome(25,095 atoms)

Achieved performance on Lion-GA

ns/

day

Amber

Page 25: Nvidia GTC 2014 Talk

04/01/14 25

PetaChem

Nvidia GTC 2014

● Quantum Chemistry designed to run on NVIDIA series hardware

● Features restricted Hartree-Fock and grid-based Kohn-Sham single point energy and gradient calculations

● Various functions supported, geometry optimization, ab-initio molecular dynamics, support for multi-GPU

● Benchmark: single point energy, using basis 6-31g for Olestra

Page 26: Nvidia GTC 2014 Talk

04/01/14 26

PetaChem

Nvidia GTC 2014

1 M2070 3 M2070 5 M2070 7 M2070

01

00

20

03

00

40

05

00

60

0

PetaChem Olestra SCF calculationTotal walltime (in s) on Lion-GA

Wallti

me (

s)

Page 27: Nvidia GTC 2014 Talk

04/01/14 27

Quantum Espresso

Nvidia GTC 2014

● Density Functional Theory (DFT) has enjoyed huge growth in popularity owing to computational and numerical advancements; used widely in material science

● Quantum Espresso (QE) is an open source DFT package that has recently added GPU acceleration, largely through BLAS and FFT routines

● When building QE with MAGMA (UT/ORNL) or phiGEMM, one introduces heterogeneous CPU/GPU linear algebra routines

● Benchmark:

● Self-consistent field calculation, using PBE pseudopotentials,168 atoms (cellulose)

● Periodic boundary conditions, kinetic energy cutoff (Ry) for charge density of 80 Ry, Davidson diagonalization

Page 28: Nvidia GTC 2014 Talk

04/01/14 28Nvidia GTC 2014

1 K20 2 K20 4 K20 8 K20 16 K20 32 K20

01

23

45

67

SCF calculation for celluloseTotal walltime (in hrs) on Stampede@TACC

Wallti

me (

hr s

)

Quantum Espresso

Page 29: Nvidia GTC 2014 Talk

04/01/14 29

Lanczos Diagonalization

Nvidia GTC 2014

● Key task in many applications, esp quantum chemistry & DFT is diagonalization ie., matrix eigen-decomposition

● Lanczos is a power method, produces a tri-diagonal matrix, more readily solvable; consists of many matrix-vector operations, very amenable to GPU, currently using cuBLAS &MKL in a heterogeneous solution.

● Originally devised for fundamental physics project at PSU, now intended for incorporation into GPU-Quantum Espresso project being led by Filippo Spiga

● Attempting to scale to multiple devices using MPI + GPUdirect, still beset by some numerical/convergence problems with increasing matrix size

Page 30: Nvidia GTC 2014 Talk

04/01/14 30

Lanczos Diagonalization

Nvidia GTC 2014

Page 31: Nvidia GTC 2014 Talk

04/01/14 31

Lanczos Diagonalization

Nvidia GTC 2014

● CUDA 5.5/Kepler overall yields pleasing communication results (CUDA-enabled openmpi 1.7.3, MPI send/recv), collectives less impressive

● Bandwidths for one-sided comms have some message size dependency &jitter, but effective bandwidth much improved over previous gens.

1e+07

2 4 6 8

5

4

3

2Ban

dwi d

th G

B/s

Increasing msg size in MB, within single application

● Results of 4 tests● Rhel 6, Intel x86_64, Nvidia

driver 331.38 ● Communication btwn K20 & K40

Page 32: Nvidia GTC 2014 Talk

04/01/14 32

Outline● Center Overview (RCC @ PSU)● GPU accelerated research

● IceCube● Metabolic Networks (Fsolve/cuSolve)● MD + Simulated Annealing

● FQHE (LU Decomposition)

● Smart Proppants (QR Decomposition)

● GPU cluster scaling

● Amber● PetaChem● Quantum Espresso

– Lanczos Diagonalization● CUDA, needs + wants

● Summary

Nvidia GTC 2014

Page 33: Nvidia GTC 2014 Talk

04/01/14 33

CUDA needs + wants

Nvidia GTC 2014

● ODE and Function Solver(s), metabolic networks, chemically reactive flows w/ OpenFOAM→ support for more C++11 language features?

● Lanczos Diagonalization, DFT/quantum chemistry, incorporation into Quantum Espresso→ further improvements to GPUdirect (or use new multi-GPU interfaces instead)?

● Batch LU/QR → increased warp size?

Page 34: Nvidia GTC 2014 Talk

04/01/14 34

Summary

Nvidia GTC 2014

● Early adopters astrophysics, quantum chem/condensed matter still active, see most growth in strands of computational biology/life science, 'big data'

● Teaching seminars generally well received/attended, but...

● Most success from working to identify users/codes that can benefit from GPU by monitoring clusters, and on a related note...

● The harvest is plentiful in academia but the workers are few; generally if a code 'works' little pressure to make it better

● However changes even in traditional CPU architecture are forcing workers to reevaluate their computational models (thanks Ken Esler for this perspective); we live more and more in a parallel world

Page 35: Nvidia GTC 2014 Talk

04/01/14 35

Acknowledgements

Nvidia GTC 2014

● Mark Berger, Chandra Cheij &Nvidia for generous donations

● {Ryan Eagen/Cowen group, Ali Khodayari/Maranas group, Sreejith Jaya Ganesh, Jim Kubicki, Dan Haworth, Adri Van Duin} PSU

● {Chuck Gilbert, Jason Holmes} long-suffering sys admins

● HP for donation of 50 M2070

● XSEDE/TACC for Stampede cycles