Programming models for quantum many-body methods on ... · Programming models for quantum many-body...
Transcript of Programming models for quantum many-body methods on ... · Programming models for quantum many-body...
Programming models for quantummany-body methods on multicore and
manycore processors
Jeff Hammond1 and Eugene DePrince2
1 Argonne2 Georgia Tech
6 February 2011
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Abstract
The growing diversity in computer processor architectures poses aserious challenge to the computational chemistry community. Thistalk considers some of the key issues, including disjoint addressspaces, non-standard architectures and execution models, and thedifferent APIs required to use them. Specifically, we will describeour experiences in developing coupled-cluster methods for Intelmulticore, Intel MIC, NVIDIA Fermi and Blue Gene/Q in both aclean-sheet implementation and NWChem. Of fundamentalinterest is the development of codes that scale not only within thenode but across thousands of nodes; hence, the interactionbetween the processor and the network will be analyzed in detail.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Software Automation
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
The practical TCE – NWChem many-body codes
What does it do?
1 GUI input quantum many-body theory e.g. CCSD.
2 Operator specification of theory.
3 Apply Wick’s theory to transform operator expressions intoarray expressions.
4 Transform input array expression to operation tree using manytypes of optimization.
5 Produce Fortran+Global Arrays+NXTVAL implementation.
Developer can intercept at various stages to modify theory,algorithm or implementation.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
The practical TCE – Success stories
First parallel implementation of many (most) CC methods.
First truly generic CC code (not string-based):{RHF,ROHF,UHF}×CC{SD,SDT,SDTQ}×{T/Λ,EOM,LR/QR}Most of the largest calculations of their kind employ TCE:CR-EOMCCSD(T), CCSD-LR α, CCSD-QR β, CCSDT-LR α
Reduces implementation time for new methods from years tohours, TCE codes are easy to verify.
Significant hand-tuning by Karol Kowalski and others at PNNLwas required to make TCE run efficiently and scale to 1000processors and beyond.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Before TCE
CCSD/aug-cc-pVDZ – 192 b.f. – days on 1 processor
Benzene is close to crossover point between small and large.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Linear response polarizability
CCSD/Z3POL – 1080 b.f. – 40 hours on 1024 processors
This problem is 20,000 times larger on the computer than benzene.
J. Chem. Phys. 129, 226101 (2008).
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Quadratic response hyperpolarizability
CCSD/d-aug-cc-pVTZ – 812 b.f. – 20 hours on 1024 processors
Lower levels of theory are not reliable for this system.
J. Chem. Phys. 130, 194108 (2009).
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Charge-transfer excited-states of biomolecules
CR-EOMCCSD(T)/6-31G* – 584 b.f. – 1 hour on 256 cores
Lower levels of theory are wildly incorrect for this system.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Excited-state calculation of conjugated arrays
CR-EOMCCSD(T)/6-31+G* – 1096 b.f. – 15 hours on 1024 cores
J. Chem. Phys. 132, 154103 (2010).
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Summary of TCE module
http://cloc.sourceforge.net v 1.53 T=30.0 s
---------------------------------------------
Language files blank comment code
---------------------------------------------
Fortran 77 11451 1004 115129 2824724
---------------------------------------------
SUM: 11451 1004 115129 2824724
---------------------------------------------
Only <25 KLOC are hand-written; ∼100 KLOC is utility codefollowing TCE data-parallel template.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
My thesis work
http://cloc.sourceforge.net v 1.53 T=13.0 s
---------------------------------------------
Language files blank comment code
---------------------------------------------
Fortran 77 5757 0 29098 983284
---------------------------------------------
SUM: 5757 0 29098 983284
---------------------------------------------
Total does not include ∼1M LOC that was reused (EOM).
CCSD quadratic response hyperpolarizability was derived,implemented and verified during a two week trip to PNNL.Over 100 KLOC were “written” in under an hour.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Before Automation
(The Benz comes before the Ford)
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
What code are we going to generate?
Legacy design limits options
NVIDIA hardware was changing fast
CUDA Fortran didn’t exist in 2009
Eugene had a loop-based CEPA code in C. . .
Assume that data motion across PCI bus was limiting
Code changed a lot as CUBLAS supported streams, etc.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Coupled-cluster theory
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Coupled-cluster theory
The coupled–cluster (CC) wavefunction ansatz is
|CC 〉 = eT |HF 〉
where T = T1 + T2 + · · ·+ Tn.
T is an excitation operator which promotes n electrons fromoccupied orbitals to virtual orbitals in the Hartree-Fock Slaterdeterminant.
Inserting |CC 〉 into the Schodinger equation:
HeT |HF 〉 = ECCeT |HF 〉 H|CC 〉 = ECC |CC 〉
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Coupled-cluster theory
|CC 〉 = exp(T )|0〉T = T1 + T2 + · · ·+ Tn (n� N)
T1 =∑ia
tai a†aai
T2 =∑ijab
tabij a†aa†baj ai
|ΨCCD〉 = exp(T2)|ΨHF 〉= (1 + T2 + T 2
2 )|ΨHF 〉|ΨCCSD〉 = exp(T1 + T2)|ΨHF 〉
= (1 + T1 + · · ·+ T 41 + T2 + T 2
2 + T1T2 + T 21T2)|ΨHF 〉
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Coupled-cluster theory
Projective solution of CC:
ECC = 〈HF |e−THeT |HF 〉0 = 〈X |e−THeT |HF 〉 (X = S ,D, . . .)
CCD is:
ECC = 〈HF |e−T2HeT2 |HF 〉0 = 〈D|e−T2HeT2 |HF 〉
CCSD is:
ECC = 〈HF |e−T1−T2HeT1+T2 |HF 〉0 = 〈S |e−T1−T2HeT1+T2 |HF 〉0 = 〈D|e−T1−T2HeT1+T2 |HF 〉
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Notation
H = H1 + H2
= F + V
F is the Fock matrix. CC only uses the diagonal in the canonicalformulation.
V is the fluctuation operator and is composed of two-electronintegrals as a 4D array.
V has 8-fold permutation symmetry in V rspq and is divided into six
blocks: V klij , V ka
ij , V jbia , V ab
ij , V bcia , V cd
ab .
Indices i , j , k , . . . (a, b, c , . . .) run over the occupied (virtual)orbitals.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CCD Equations
Rabij = V ab
ij + P(ia, jb)
[T aeij I be − T ab
im Imj +1
2V abef T
efij +
1
2T abmnI
mnij − T ae
mj Imbie − Ima
ie T ebmj + (2T ea
mi − T eaim)Imb
ej
]
I ab = (−2Vmneb + Vmn
be )T eamn
I ij = (2Vmief − V im
ef )T efmj
I ijkl = V ijkl + V ij
ef Tefkl
I iajb = V iajb −
1
2V imeb T
eajm
I iabj = V iabj + V im
be (T eamj −
1
2T aemj )−
1
2Vmibe T
aemj
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Turning CC into GEMM 1
Some tensor contractions aretrivially mapped to GEMM:
I ijkl + = V ijef T
efkl
I(ij)(kl) + = V
(ij)(ef )T
(ef )(kl)
I ba + = V bc T
ca
Other contractions requirereordering to use BLAS:
I iabj + = V imbe T
eamj
Ibj ,ia + = Vbe,imTmj ,ea
Jbi ,ja + = Wbi ,meUme,ja
J jabi + = Wmebi U ja
me
J(ja)(bi) + = W
(me)(bi) U
(ja)(me)
Jzx + = W yx U
zy
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Turning CC into GEMM 2
Reordering can take as much time as GEMM. Why?
Routine flops mops pipelined
GEMM O(mnk) O(mn + mk + kn) yesreorder 0 O(mn + mk + kn) no
Increased memory bandwidth on GPU makes reordering lessexpensive (compare matrix transpose).
(There is a chapter in my thesis with profiling results and moredetails if anyone cares.)
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Hardware Details
CPU GPUX5550 2 X5550 C1060 C2050
processor speed (MHz) 2660 2660 1300 1150memory bandwidth (GB/s) 32 64 102 144
memory speed (MHz) 1066 1066 800 1500ECC available yes yes no yesSP peak (GF) 85.1 170.2 933 1030DP peak (GF) 42.6 83.2 78 515
power usage (W) 95 190 188 238
Note that power consumption is apples-to-oranges since CPU doesnot include DRAM, whereas GPU does.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Relative Performance of GEMM
GPU versus SMP CPU (8 threads):
0
100
200
300
400
500
600
700
0 1000 2000 3000 4000 5000
perf
orm
ance
(gi
gaflo
p/s)
rank
SGEMM performance
X5550C2050
Maximum:CPU = 156.2 GFGPU = 717.6 GF
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000
perf
orm
ance
(gi
gaflo
p/s)
rank
DGEMM performance
X5550C2050
Maximum:CPU = 79.2 GF
GPU = 335.6 GF
We expect roughly 4-5 times speedup based upon this evaluation.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Chemistry Details
Molecule o v
C8H10 21 63C10H8 24 72C10H12 26 78C12H14 31 93C14H10 33 99C14H16 36 108
C20 40 120C16H18 41 123C18H12 42 126C18H20 46 138
6-31G basis set
C1 symmetry
F and V from PSI3.
GPU code now runs from PSI3.
Talk to Eugene about PSI4 (GPL).
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CCD Algorithm
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CCD Performance Results
Iteration time in secondsour DP code X5550
C2050 C1060 X5550 Molpro TCE GAMESS
C8H10 0.3 0.8 1.3 2.3 5.1 6.2C10H8 0.5 1.5 2.5 4.8 10.6 12.7C10H12 0.8 2.5 3.5 7.1 16.2 19.7C12H14 2.0 7.1 10.0 17.6 42.0 57.7C14H10 2.7 10.2 13.9 29.9 59.5 78.5C14H16 4.5 16.7 21.6 41.5 90.2 129.3
C20 8.8 29.9 40.3 103.0 166.3 238.9C16H18 10.5 35.9 50.2 83.3 190.8 279.5C18H12 12.7 42.2 50.3 111.8 218.4 329.4C18H20 20.1 73.0 86.6 157.4 372.1 555.5
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CCD Performance Summary
0
100
200
300
400
500
600
700
C8H10
C10H8
C10H12
C12H14
C14H10
C14H16
C20
C16H18
C18H12
C18H20
performance (gigaflop/s)
molecule
C1060 SPC1060 DPC2050 SPC2050 DP
Xeon X5550 SPXeon X5550 DP
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Numerical Precision versus Performance
Iteration time in secondsC1060 C2050 X5550
molecule SP DP SP DP SP DP
C8H10 0.2 0.8 0.2 0.3 0.7 1.3C10H8 0.4 1.5 0.2 0.5 1.3 2.5C10H12 0.7 2.5 0.4 0.8 2.0 3.5C12H14 1.8 7.1 1.0 2.0 5.6 10.0C14H10 2.6 10.2 1.5 2.7 8.4 13.9C14H16 4.1 16.7 2.4 4.5 12.1 21.6
C20 6.7 29.9 4.1 8.8 22.3 40.3C16H18 9.0 35.9 5.0 10.5 28.8 50.2C18H12 10.1 42.2 5.6 12.7 29.4 50.3C18H20 17.2 73.0 10.1 20.1 47.0 86.6
Almost no work on single-precision or mixed-precision forstandard CPU packages even though it is worth 2x.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CCSD Equations
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CCSD Algortihm
Guiding principles:
Too many arrays to fit into GPU memory.
Copy-in every iteration in CCD was not a problem
Want multi-GPU, mixed CPU-GPU algorithms.
Design:
Persistent buffers but push all large arrays every iteration.
O(N6), O(N5) on GPU.
O(N5), O(N4) on CPU.
Dynamically schedule some diagrams each iteration toload-balance.
Overlap computation and communication with CUDA streams(CUBLAS compatible now).
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Hybrid CCSD
Iteration time in secondsHybrid CPU Molpro NWChem PSI3 TCE GAMESS
C8H10 0.6 1.4 2.4 3.6 7.9 8.4 7.2C10H8 0.9 2.6 5.1 8.2 17.9 16.8 15.3C10H12 1.4 4.1 7.2 11.3 23.6 25.2 23.6C12H14 3.3 11.1 19.0 29.4 54.2 64.4 65.1C14H10 4.4 15.5 31.0 49.1 61.4 90.7 92.9C14H16 6.3 24.1 43.1 65.0 103.4 129.2 163.7
C20 10.5 43.2 102.0 175.7 162.6 233.9 277.5C16H18 10.0 38.9 84.1 117.5 192.4 267.9 345.8C18H12 14.1 57.1 116.2 178.6 216.4 304.5 380.0C18H20 22.5 95.9 161.4 216.3 306.9 512.0 641.3
We do at least twice as many flops as Molpro due to CC formalism.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
More hybrid CCSD
molecule Basis o v Hybrid CPU Molpro CPU MolproCH3OH aTZ 7 175 2.5 4.5 2.8 1.8 1.1benzene aDZ 15 171 5.1 14.7 17.4 2.9 3.4
C2H6SO4 aDZ 23 167 9.0 33.2 31.2 3.7 3.5C10H12 DZ 26 164 10.7 39.5 56.8 3.7 5.3C10H12 6-31G 26 78 1.4 4.1 7.2 2.9 5.1
Molpro optimized for v/o � 1.
Our code doesn’t favor any limit except o, v � 1.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Programming Models
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Categorizing parallel models
MPI-1 Private address space, explicit communication,independent execution.
MPI-RMA Private-ish address space, explicit communication,independent execution.
OpenMP (loops) Shared address space, implicitcommunication, collective execution.
OpenMP (sections) Shared address space, implicitcommunication, semi-independent execution.
Pthreads Shared address space, independent execution.
GA Global view of data (not coherent), explicitcommunication, independent execution.
PGAS Shared address space, implicit/explicit communication,independent execution.
OpenMP is always implemented on top of Pthreads.PGAS can use MPI and/or Pthreads.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Categorizing models
Execution model: independent/tasks vs.cooperative/data-parallel
Memory sharing: private by default vs. public bydefault
Threading: parallel by default vs. serial bydefault
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Decision making process
MPI exists in a constructive cycle of ubiquityand optimization.
OpenMP is reaching ubiquity; optimization isdebatable.
Pthreads is ubiquitous but invisible (systemprogrammers only).
PGAS portability, especially performanceportability, is still emerging.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
OpenMP: from 1 to N or N to 1?
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
OpenMP: from 1 to N or N to 1?
#pragma omp parallel
{ /* thread-safe */ }#pragma omp single
{ /* thread-unsafe */ }#pragma omp parallel for
{ /* threaded loops */ }#pragma omp sections
{ /* threaded work */ }
{ /* thread-unsafe */ }#pragma omp parallel for
{ /* threaded loops */ }{ /* thread-unsafe */ }#pragma omp parallel for
{ /* threaded loops */ }/* thread-unsafe work */
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Addition thoughts
Pthreads very useful for driving communication
OpenMP NUMA catastrophe (on Linux
MPI+SHM vs. Pthreads (shared vs. private question)
Learn from linear algebra community:DAG-scheduling with Pthreads not OpenMP
– SK+KK TCE task pools in EOMCCSD, MRCC– our GPU-CC code using task parallelism throughout
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Beyond one node
NWChem, MADNESS, MPQC, GAMESS (?) burn core(s)for communication
Multi-socket, multi-GPU, multi-rail all gets nasty
Pthread support for affinity becomes more critical
NWChem data-server uses ∼ 25% of cores with Infiniband
Do your GPUs need communication helper threads?
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Pragmas
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
PGI pragmas
Q Can I call a CUDA kernel function from my PGI-compiled code?
A PGI is working on the design of a feature to allow you to callkernel functions written in CUDA or PTX or other languagesdirectly from your C or Fortran program. We will announce thisfeature when it is available.
Simple tensor contractions are done with DGEMM; nonsimple onesredistribution to simple ones (permutation).
CUBLAS DGEMM with CUDA permutation codes
Pragma DGEMM with pragma permutation codes
Pragma code only works with Fortran(C is a terrible language to optimize)
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
PGI pragmas
#pragma acc region
{for(i=0; i<num; i++)
for(j=0; i<num; j++)
for(k=0; i<num; k++)
c[i*1000+j] += ( a[i*1000+k] * b[k*1000+j] );
}
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
PGI pragmas
42, Accelerator region ignored
44, Accelerator restriction: size of the GPU copy of an array
depends on values computed in this loop
45, Accelerator restriction: size of the GPU copy of ’c’ is
unknown
Accelerator restriction: size of the GPU copy of an array
depends on values computed in this loop
Accelerator restriction: one or more arrays have unknown size
46, Accelerator restriction: size of the GPU copy of ’a’ is
unknown
Accelerator restriction: size of the GPU copy of ’b’ is unknown
Accelerator restriction: one or more arrays have unknown size
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
PGI: loop-based matrix multiplication
0
2
4
6
8
10
12
14
16
18
0 1000 2000 3000 4000 5000 6000
giga
flop/
s
rank
Matrix multiplication with pragmas
GPU KIJ
GotoBLAS on my Core2 laptop does 20 GF.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Pragmas
Usage:
Must allow native function calls
Pragmas to generate native object rather than offload
Explicit data movement
Explicit data movement is something scientists can do!
Things compilers cannot do well:
Optimize for cache (Netlib vs. Goto)
Generate network communication (PGAS)
Move data across PCI bus (GPU pragmas)
Prove me wrong. . .
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Acknowledgments
Computational Fellowship (AED)ANL-MCS Breadboard clusterANL-LCRC Fusion cluster
Dirac cluster
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Shameless promotion
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
SAAHPC 2012
What: Symposium on Application Accelerators in HPCWhen: July 10-12, 2012Where: Argonne (Chicago, IL)
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
TCE on GPUs
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
NWChem TCE module
Automatically generated code: easy to rewrite (inprinciple).
Data-parallel over tiles using Global Arrays.
Naive dynamic load-balancing (shared counter).
Standard GA programming model:check out, compute, check in
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
TCE CPU algorithm
for (P3,P4,H1,H2) in all (P,P,H,H) tiles:
if (my turn) and (nonzero symmetry):
allocate and zero buffer Jc
for (P5,P6) in all (P,P) tiles:
if (nonzero symmetry):
allocate and zero buffer Tc
get Tb from global T
reorder Tc
allocate and zero buffer Ic
get Ib from global I
reorder Ic
compute Jc += Tc*Ic
reorder Jc
acc Jc onto global J
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
TCE GPU algorithm
grab from pool and zero buffer pair (Tc,Tg)
grab from pool and zero buffer pair (Ic,Ig)
if (push gpucompute pull < cpucompute):
iget and push Tg from global T
iget and push Ig from global I
igpu reorder Tg
igpu reorder Ig
gpucompute Jg += Tg*Ig
else:
iget Tc from global T
iget Ic from global I
compute Jc += Tc*Ic
ireorder Jg
reorder Jc
pull and acc Jc += Jg
acc Jc onto global J
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Evaluating the GPU algorithm
Tilesizes not big enough to justify GPU all the time (bad).
Jg stays on the GPU through each pass (good).
Possibly good overlap of data movement (good).
Can unroll inner loop for maximum overlap (good), butthis doubles the memory required (bad).
Threaded CPU code helps memory issues but GA/ARMCInot thread-safe (funneled or serialized should be okay).
Many of the GPU-oriented optimizations will help withmulticore CPUs. Fully rewritten GPU TCE in NWChem willprobably coincide with porting to BGQ for this reason.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Partial bibliography
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
MD on GPUs (of many more)
All major MD packages have or will soon have a GPU implementation.
J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,and K. Schulten. “Accelerating molecular modeling applications withgraphics processors.” J. Comp. Chem., 28 (16), 2618–2640, 2007.
J. A. Anderson, C. D. Lorenz, and A. Travesset. “General purposemolecular dynamics simulations fully implemented on graphics processingunits.” J. Comp. Phys., 227 (10), 5342–5359, 2008.
P. Friedrichs, M. S.and Eastman, V. Vaidyanathan, M. Houston,S. Legrand, A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande.“Accelerating molecular dynamic simulation on graphics processingunits.” J. Comp. Chem., 30 (6), 864–872, 2009.
R. Yokota, T. Hamada, J. P. Bardhan, M. G. Knepley, and L. A. Barba.“Biomolecular electrostatics simulation by an FMM-based BEM on 512GPUs.” CoRR, abs/1007.4591, 2010.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
DFT on GPUs
K. Yasuda, “Accelerating density functional calculations with graphicsprocessing unit.” J. Chem. Theo. Comp., 4 (8), 1230–1236, 2008.
L. Genovese, M. Ospici, T. Deutsch, J.-F. Mehaut, A. Neelov, andS. Goedecker. “Density functional theory calculation on many-coreshybrid central processing unit-graphic processing unit architectures.” J.Chem. Phys., 131 (3), 034103, 2009.
I. S. Ufimtsev and T. J. Martınez. “Quantum chemistry on graphicalprocessing units. 2. Direct self-consistent-field (SCF) implementation.”J. Chem. Theo. Comp., 5 (4), 1004–1015, 2009; “Quantum chemistry ongraphical processing units. 3. Analytical energy gradients, geometryoptimization, and first principles molecular dynamics.” J. Chem. Theo.Comp., 5 (10), 2619–2628, 2009.
C. J. Woods, P. Brown, and F. R. Manby. “Multicore parallelization ofKohn-Sham theory.” J. Chem. Theo. Comp., 5 (7), 1776–1784, 2009.
P. Brown, C. J. Woods, S. McIntosh-Smith, and F. R. Manby. “Amassively multicore parallelization of the Kohn-Sham energy gradients.”J. Comp. Chem., 31 (10), 2008–2013, 2010.
R. Farmber, E. Bylaska, S. Baden and students. “(Car-Parrinello onGPGPUs).” Work in progress, 2011.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
MP2 on GPUs
MP2 using the resolution-of-identity approximation involves a few largeGEMMs.
L. Vogt, R. Olivares-Amaya, S. Kermes, Y. Shao, C. Amador-Bedolla, andA. Aspuru-Guzik. “Accelerating resolution-of-the-identity second-orderMøller-Plesset quantum chemistry calculations with graphical processingunits.” J. Phys. Chem. A, 112 (10), 2049–2057, 2008.
R. Olivares-Amaya, M. A. Watson, R. G. Edgar, L. Vogt, Y. Shao, andA. Aspuru-Guzik. “Accelerating correlated quantum chemistrycalculations using graphical processing units and a mixed precision matrixmultiplication library.” J. Chem. Theo. Comp., 6 (1), 135–144, 2010.
A. Koniges, R. Preissl, J. Kim, D. Eder, A. Fisher, N. Masters, V. Mlaker,S. Ethier, W. Wang, and M. Head-Gordon. “Application acceleration oncurrent and future Cray platforms.” In CUG 2010, Edinburgh, Scotland,May 2010.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
QMC on GPUs
QMC is ridiculously parallel at the node-level hence implementation of thekernel implies the potential for scaling to many thousands of GPUs.
A. G. Anderson, W. A. Goddard III, and P. Schroder. “Quantum MonteCarlo on graphical processing units.” Comp. Phys. Comm., 177 (3),298–306, 2007.
A. Gothandaraman, G. D. Peterson, G. Warren, R. J. Hinde, and R. J.Harrison. “FPGA acceleration of a quantum Monte Carlo application.”Par. Comp., 34 (4-5), 278–291, 2008.
K. Esler, J. Kim, L. Shulenburger, and D. Ceperley. “Fully acceleratingquantum Monte Carlo simulations of real materials on GPU clusters.”Comp. Sci. Eng., 99 (PrePrints), 2010.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
CC on GPUs
Since we started this project, two groups have implemented non-iterative triplescorrections to CCSD on GPUs. These procedures involve a few very largeGEMMs and a reduction. At least two other groups are working on CC onGPUs but have not reported any results.
M. Melichercık, L. Demovic, and P. N. Michal Pitonak. “Acceleration ofCCSD(T) computations using technology of graphical processing unit.”2010.
W. Ma, S. Krishnaoorthy, O. Villa, and K. Kowalski. “GPU-basedimplementations of the regularized CCSD(T) method: applications tostrongly correlated systems.” Submitted to J. Chem. Theo. Comp., 2010.
A. E. DePrince and J. .R. Hammond. “Coupled-cluster theory ongraphics processing units I. The coupled-cluster doubles method.”Submitted to J. Chem. Theo. Comp., 2010.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators
Gaussian Integrals on GPUs
Quantum chemistry methods spend a lot of time generating matrix elements ofoperators in a Gaussian basis set. All published implementations are eitherclosed-source (commercial) or the source is unpublished, otherwise we would beusing these in our code.
I. S. Ufimtsev and T. J. Martınez. “Quantum chemistry on graphicalprocessing units. 1. Strategies for two-electron integral evaluation.” J.Chem. Theo. Comp., 4 (2), 222–231, 2008.
A. V. Titov, V. V. Kindratenko, I. S. Ufimtsev, and T. J. Martınez.“Generation of kernels for calculating electron repulsion integrals of highangular momentum functions on GPUs – preliminary results.” In Proc.SAAHPC 2010, pages 1–3, 2010.
A. Asadchev, V. Allada, J. Felder, B. M. Bode, M. S. Gordon, and T. L.Windus. “Uncontracted Rys quadrature implementation of up to Gfunctions on graphical processing units.” J. Chem. Theo. Comp., 6 (3),696–704, 2010.
Jeff Hammond Electronic Structure Calculation Methods on Accelerators