Programming models for quantum many-body methods on ... · Programming models for quantum many-body...

Programming models for quantummany-body methods on multicore and

manycore processors

Jeff Hammond1 and Eugene DePrince2

1 Argonne2 Georgia Tech

6 February 2011

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Abstract

The growing diversity in computer processor architectures poses aserious challenge to the computational chemistry community. Thistalk considers some of the key issues, including disjoint addressspaces, non-standard architectures and execution models, and thedifferent APIs required to use them. Specifically, we will describeour experiences in developing coupled-cluster methods for Intelmulticore, Intel MIC, NVIDIA Fermi and Blue Gene/Q in both aclean-sheet implementation and NWChem. Of fundamentalinterest is the development of codes that scale not only within thenode but across thousands of nodes; hence, the interactionbetween the processor and the network will be analyzed in detail.


Software Automation


The practical TCE – NWChem many-body codes

What does it do?

1 GUI input quantum many-body theory e.g. CCSD.

2 Operator specification of theory.

3 Apply Wick’s theory to transform operator expressions intoarray expressions.

4 Transform input array expression to operation tree using manytypes of optimization.

5 Produce Fortran+Global Arrays+NXTVAL implementation.

Developer can intercept at various stages to modify theory,algorithm or implementation.


The practical TCE – Success stories

First parallel implementation of many (most) CC methods.

First truly generic CC code (not string-based):{RHF,ROHF,UHF}×CC{SD,SDT,SDTQ}×{T/Λ,EOM,LR/QR}Most of the largest calculations of their kind employ TCE:CR-EOMCCSD(T), CCSD-LR α, CCSD-QR β, CCSDT-LR α

Reduces implementation time for new methods from years tohours, TCE codes are easy to verify.

Significant hand-tuning by Karol Kowalski and others at PNNLwas required to make TCE run efficiently and scale to 1000processors and beyond.


Before TCE

CCSD/aug-cc-pVDZ – 192 b.f. – days on 1 processor

Benzene is close to crossover point between small and large.


Linear response polarizability

CCSD/Z3POL – 1080 b.f. – 40 hours on 1024 processors

This problem is 20,000 times larger on the computer than benzene.

J. Chem. Phys. 129, 226101 (2008).


http://link.aip.org/link/jcpsa6/v129/i22/p226101/s1

Quadratic response hyperpolarizability

CCSD/d-aug-cc-pVTZ – 812 b.f. – 20 hours on 1024 processors

Lower levels of theory are not reliable for this system.

J. Chem. Phys. 130, 194108 (2009).


http://jcp.aip.org/resource/1/jcpsa6/v130/i19/p194108_s1

Charge-transfer excited-states of biomolecules

CR-EOMCCSD(T)/6-31G* – 584 b.f. – 1 hour on 256 cores

Lower levels of theory are wildly incorrect for this system.


Excited-state calculation of conjugated arrays

CR-EOMCCSD(T)/6-31+G* – 1096 b.f. – 15 hours on 1024 cores

J. Chem. Phys. 132, 154103 (2010).


http://link.aip.org/link/JCPSA6/v132/i15/p154103/s1

Summary of TCE module

http://cloc.sourceforge.net v 1.53 T=30.0 s

---------------------------------------------

Language files blank comment code

---------------------------------------------

Fortran 77 11451 1004 115129 2824724

---------------------------------------------

SUM: 11451 1004 115129 2824724

---------------------------------------------

Only <25 KLOC are hand-written; ∼100 KLOC is utility codefollowing TCE data-parallel template.


My thesis work

http://cloc.sourceforge.net v 1.53 T=13.0 s

---------------------------------------------

Language files blank comment code

---------------------------------------------

Fortran 77 5757 0 29098 983284

---------------------------------------------

SUM: 5757 0 29098 983284

---------------------------------------------

Total does not include ∼1M LOC that was reused (EOM).

CCSD quadratic response hyperpolarizability was derived,implemented and verified during a two week trip to PNNL.Over 100 KLOC were “written” in under an hour.


Before Automation

(The Benz comes before the Ford)


What code are we going to generate?

Legacy design limits options

NVIDIA hardware was changing fast

CUDA Fortran didn’t exist in 2009

Eugene had a loop-based CEPA code in C. . .

Assume that data motion across PCI bus was limiting

Code changed a lot as CUBLAS supported streams, etc.


Coupled-cluster theory


Notation

H = H1 + H2

= F + V

F is the Fock matrix. CC only uses the diagonal in the canonicalformulation.

V is the fluctuation operator and is composed of two-electronintegrals as a 4D array.

V has 8-fold permutation symmetry in V rspq and is divided into six

blocks: V klij , V ka

ij , V jbia , V ab

ij , V bcia , V cd

ab .

Indices i , j , k , . . . (a, b, c , . . .) run over the occupied (virtual)orbitals.


CCD Equations

Rabij = V ab

ij + P(ia, jb)

[T aeij I be − T ab

im Imj +1

2V abef T

efij +

1

2T abmnI

mnij − T ae

mj Imbie − Ima

ie T ebmj + (2T ea

mi − T eaim)Imb

ej

]

I ab = (−2Vmneb + Vmn

be )T eamn

I ij = (2Vmief − V im

ef )T efmj

I ijkl = V ijkl + V ij

ef Tefkl

I iajb = V iajb −

1

2V imeb T

eajm

I iabj = V iabj + V im

be (T eamj −

1

2T aemj )−

1

2Vmibe T

aemj


Turning CC into GEMM 1

Some tensor contractions aretrivially mapped to GEMM:

I ijkl + = V ijef T

efkl

I(ij)(kl) + = V

(ij)(ef )T

(ef )(kl)

I ba + = V bc T

ca

Other contractions requirereordering to use BLAS:

I iabj + = V imbe T

eamj

Ibj ,ia + = Vbe,imTmj ,ea

Jbi ,ja + = Wbi ,meUme,ja

J jabi + = Wmebi U ja

me

J(ja)(bi) + = W

(me)(bi) U

(ja)(me)

Jzx + = W yx U

zy


Turning CC into GEMM 2

Reordering can take as much time as GEMM. Why?

Routine flops mops pipelined

GEMM O(mnk) O(mn + mk + kn) yesreorder 0 O(mn + mk + kn) no

Increased memory bandwidth on GPU makes reordering lessexpensive (compare matrix transpose).

(There is a chapter in my thesis with profiling results and moredetails if anyone cares.)


Hardware Details

CPU GPUX5550 2 X5550 C1060 C2050

processor speed (MHz) 2660 2660 1300 1150memory bandwidth (GB/s) 32 64 102 144

memory speed (MHz) 1066 1066 800 1500ECC available yes yes no yesSP peak (GF) 85.1 170.2 933 1030DP peak (GF) 42.6 83.2 78 515

power usage (W) 95 190 188 238

Note that power consumption is apples-to-oranges since CPU doesnot include DRAM, whereas GPU does.


Relative Performance of GEMM

GPU versus SMP CPU (8 threads):

0

100

200

300

400

500

600

700

0 1000 2000 3000 4000 5000

perf

orm

ance

(gi

gaflo

p/s)

rank

SGEMM performance

X5550C2050

Maximum:CPU = 156.2 GFGPU = 717.6 GF

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000

perf

orm

ance

(gi

gaflo

p/s)

rank

DGEMM performance

X5550C2050

Maximum:CPU = 79.2 GF

GPU = 335.6 GF

We expect roughly 4-5 times speedup based upon this evaluation.


Chemistry Details

Molecule o v

C8H10 21 63C10H8 24 72C10H12 26 78C12H14 31 93C14H10 33 99C14H16 36 108

C20 40 120C16H18 41 123C18H12 42 126C18H20 46 138

6-31G basis set

C1 symmetry

F and V from PSI3.

GPU code now runs from PSI3.

Talk to Eugene about PSI4 (GPL).


CCD Algorithm


CCD Performance Results

Iteration time in secondsour DP code X5550

C2050 C1060 X5550 Molpro TCE GAMESS

C8H10 0.3 0.8 1.3 2.3 5.1 6.2C10H8 0.5 1.5 2.5 4.8 10.6 12.7C10H12 0.8 2.5 3.5 7.1 16.2 19.7C12H14 2.0 7.1 10.0 17.6 42.0 57.7C14H10 2.7 10.2 13.9 29.9 59.5 78.5C14H16 4.5 16.7 21.6 41.5 90.2 129.3

C20 8.8 29.9 40.3 103.0 166.3 238.9C16H18 10.5 35.9 50.2 83.3 190.8 279.5C18H12 12.7 42.2 50.3 111.8 218.4 329.4C18H20 20.1 73.0 86.6 157.4 372.1 555.5


CCD Performance Summary

0

100

200

300

400

500

600

700

C8H10

C10H8

C10H12

C12H14

C14H10

C14H16

C20

C16H18

C18H12

C18H20

performance (gigaflop/s)

molecule

C1060 SPC1060 DPC2050 SPC2050 DP

Xeon X5550 SPXeon X5550 DP


Numerical Precision versus Performance

Iteration time in secondsC1060 C2050 X5550

molecule SP DP SP DP SP DP

C8H10 0.2 0.8 0.2 0.3 0.7 1.3C10H8 0.4 1.5 0.2 0.5 1.3 2.5C10H12 0.7 2.5 0.4 0.8 2.0 3.5C12H14 1.8 7.1 1.0 2.0 5.6 10.0C14H10 2.6 10.2 1.5 2.7 8.4 13.9C14H16 4.1 16.7 2.4 4.5 12.1 21.6

C20 6.7 29.9 4.1 8.8 22.3 40.3C16H18 9.0 35.9 5.0 10.5 28.8 50.2C18H12 10.1 42.2 5.6 12.7 29.4 50.3C18H20 17.2 73.0 10.1 20.1 47.0 86.6

Almost no work on single-precision or mixed-precision forstandard CPU packages even though it is worth 2x.


CCSD Equations


CCSD Algortihm

Guiding principles:

Too many arrays to fit into GPU memory.

Copy-in every iteration in CCD was not a problem

Want multi-GPU, mixed CPU-GPU algorithms.

Design:

Persistent buffers but push all large arrays every iteration.

O(N6), O(N5) on GPU.

O(N5), O(N4) on CPU.

Dynamically schedule some diagrams each iteration toload-balance.

Overlap computation and communication with CUDA streams(CUBLAS compatible now).


Hybrid CCSD

Iteration time in secondsHybrid CPU Molpro NWChem PSI3 TCE GAMESS

C8H10 0.6 1.4 2.4 3.6 7.9 8.4 7.2C10H8 0.9 2.6 5.1 8.2 17.9 16.8 15.3C10H12 1.4 4.1 7.2 11.3 23.6 25.2 23.6C12H14 3.3 11.1 19.0 29.4 54.2 64.4 65.1C14H10 4.4 15.5 31.0 49.1 61.4 90.7 92.9C14H16 6.3 24.1 43.1 65.0 103.4 129.2 163.7

C20 10.5 43.2 102.0 175.7 162.6 233.9 277.5C16H18 10.0 38.9 84.1 117.5 192.4 267.9 345.8C18H12 14.1 57.1 116.2 178.6 216.4 304.5 380.0C18H20 22.5 95.9 161.4 216.3 306.9 512.0 641.3

We do at least twice as many flops as Molpro due to CC formalism.


More hybrid CCSD

molecule Basis o v Hybrid CPU Molpro CPU MolproCH3OH aTZ 7 175 2.5 4.5 2.8 1.8 1.1benzene aDZ 15 171 5.1 14.7 17.4 2.9 3.4

C2H6SO4 aDZ 23 167 9.0 33.2 31.2 3.7 3.5C10H12 DZ 26 164 10.7 39.5 56.8 3.7 5.3C10H12 6-31G 26 78 1.4 4.1 7.2 2.9 5.1

Molpro optimized for v/o � 1.

Our code doesn’t favor any limit except o, v � 1.


Programming Models


Categorizing parallel models

MPI-1 Private address space, explicit communication,independent execution.

MPI-RMA Private-ish address space, explicit communication,independent execution.

OpenMP (loops) Shared address space, implicitcommunication, collective execution.

OpenMP (sections) Shared address space, implicitcommunication, semi-independent execution.

Pthreads Shared address space, independent execution.

GA Global view of data (not coherent), explicitcommunication, independent execution.

PGAS Shared address space, implicit/explicit communication,independent execution.

OpenMP is always implemented on top of Pthreads.PGAS can use MPI and/or Pthreads.


Categorizing models

Execution model: independent/tasks vs.cooperative/data-parallel

Memory sharing: private by default vs. public bydefault

Threading: parallel by default vs. serial bydefault


Decision making process

MPI exists in a constructive cycle of ubiquityand optimization.

OpenMP is reaching ubiquity; optimization isdebatable.

Pthreads is ubiquitous but invisible (systemprogrammers only).

PGAS portability, especially performanceportability, is still emerging.


OpenMP: from 1 to N or N to 1?


OpenMP: from 1 to N or N to 1?

#pragma omp parallel

{ /* thread-safe */ }#pragma omp single

{ /* thread-unsafe */ }#pragma omp parallel for

{ /* threaded loops */ }#pragma omp sections

{ /* threaded work */ }

{ /* thread-unsafe */ }#pragma omp parallel for

{ /* threaded loops */ }{ /* thread-unsafe */ }#pragma omp parallel for

{ /* threaded loops */ }/* thread-unsafe work */


Addition thoughts

Pthreads very useful for driving communication

OpenMP NUMA catastrophe (on Linux

MPI+SHM vs. Pthreads (shared vs. private question)

Learn from linear algebra community:DAG-scheduling with Pthreads not OpenMP

– SK+KK TCE task pools in EOMCCSD, MRCC– our GPU-CC code using task parallelism throughout


Beyond one node

NWChem, MADNESS, MPQC, GAMESS (?) burn core(s)for communication

Multi-socket, multi-GPU, multi-rail all gets nasty

Pthread support for affinity becomes more critical

NWChem data-server uses ∼ 25% of cores with Infiniband

Do your GPUs need communication helper threads?


Pragmas


PGI pragmas

Q Can I call a CUDA kernel function from my PGI-compiled code?

A PGI is working on the design of a feature to allow you to callkernel functions written in CUDA or PTX or other languagesdirectly from your C or Fortran program. We will announce thisfeature when it is available.

Simple tensor contractions are done with DGEMM; nonsimple onesredistribution to simple ones (permutation).

CUBLAS DGEMM with CUDA permutation codes

Pragma DGEMM with pragma permutation codes

Pragma code only works with Fortran(C is a terrible language to optimize)


PGI pragmas

#pragma acc region

{for(i=0; i<num; i++)

for(j=0; i<num; j++)

for(k=0; i<num; k++)

c[i*1000+j] += ( a[i*1000+k] * b[k*1000+j] );

}


PGI pragmas

42, Accelerator region ignored

44, Accelerator restriction: size of the GPU copy of an array

depends on values computed in this loop

45, Accelerator restriction: size of the GPU copy of ’c’ is

unknown

Accelerator restriction: size of the GPU copy of an array

depends on values computed in this loop

Accelerator restriction: one or more arrays have unknown size

46, Accelerator restriction: size of the GPU copy of ’a’ is

unknown

Accelerator restriction: size of the GPU copy of ’b’ is unknown

Accelerator restriction: one or more arrays have unknown size


PGI: loop-based matrix multiplication

0

2

4

6

8

10

12

14

16

18

0 1000 2000 3000 4000 5000 6000

giga

flop/

s

rank

Matrix multiplication with pragmas

GPU KIJ

GotoBLAS on my Core2 laptop does 20 GF.


Pragmas

Usage:

Must allow native function calls

Pragmas to generate native object rather than offload

Explicit data movement

Explicit data movement is something scientists can do!

Things compilers cannot do well:

Optimize for cache (Netlib vs. Goto)

Generate network communication (PGAS)

Move data across PCI bus (GPU pragmas)

Prove me wrong. . .


Acknowledgments

Computational Fellowship (AED)ANL-MCS Breadboard clusterANL-LCRC Fusion cluster

Dirac cluster


Shameless promotion


SAAHPC 2012

What: Symposium on Application Accelerators in HPCWhen: July 10-12, 2012Where: Argonne (Chicago, IL)


TCE on GPUs


NWChem TCE module

Automatically generated code: easy to rewrite (inprinciple).

Data-parallel over tiles using Global Arrays.

Naive dynamic load-balancing (shared counter).

Standard GA programming model:check out, compute, check in


TCE CPU algorithm

for (P3,P4,H1,H2) in all (P,P,H,H) tiles:

if (my turn) and (nonzero symmetry):

allocate and zero buffer Jc

for (P5,P6) in all (P,P) tiles:

if (nonzero symmetry):

allocate and zero buffer Tc

get Tb from global T

reorder Tc

allocate and zero buffer Ic

get Ib from global I

reorder Ic

compute Jc += Tc*Ic

reorder Jc

acc Jc onto global J


TCE GPU algorithm

grab from pool and zero buffer pair (Tc,Tg)

grab from pool and zero buffer pair (Ic,Ig)

if (push gpucompute pull < cpucompute):

iget and push Tg from global T

iget and push Ig from global I

igpu reorder Tg

igpu reorder Ig

gpucompute Jg += Tg*Ig

else:

iget Tc from global T

iget Ic from global I

compute Jc += Tc*Ic

ireorder Jg

reorder Jc

pull and acc Jc += Jg

acc Jc onto global J


Evaluating the GPU algorithm

Tilesizes not big enough to justify GPU all the time (bad).

Jg stays on the GPU through each pass (good).

Possibly good overlap of data movement (good).

Can unroll inner loop for maximum overlap (good), butthis doubles the memory required (bad).

Threaded CPU code helps memory issues but GA/ARMCInot thread-safe (funneled or serialized should be okay).

Many of the GPU-oriented optimizations will help withmulticore CPUs. Fully rewritten GPU TCE in NWChem willprobably coincide with porting to BGQ for this reason.


Partial bibliography


MD on GPUs (of many more)

All major MD packages have or will soon have a GPU implementation.

J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,and K. Schulten. “Accelerating molecular modeling applications withgraphics processors.” J. Comp. Chem., 28 (16), 2618–2640, 2007.

J. A. Anderson, C. D. Lorenz, and A. Travesset. “General purposemolecular dynamics simulations fully implemented on graphics processingunits.” J. Comp. Phys., 227 (10), 5342–5359, 2008.

P. Friedrichs, M. S.and Eastman, V. Vaidyanathan, M. Houston,S. Legrand, A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande.“Accelerating molecular dynamic simulation on graphics processingunits.” J. Comp. Chem., 30 (6), 864–872, 2009.

R. Yokota, T. Hamada, J. P. Bardhan, M. G. Knepley, and L. A. Barba.“Biomolecular electrostatics simulation by an FMM-based BEM on 512GPUs.” CoRR, abs/1007.4591, 2010.


DFT on GPUs

K. Yasuda, “Accelerating density functional calculations with graphicsprocessing unit.” J. Chem. Theo. Comp., 4 (8), 1230–1236, 2008.

L. Genovese, M. Ospici, T. Deutsch, J.-F. Mehaut, A. Neelov, andS. Goedecker. “Density functional theory calculation on many-coreshybrid central processing unit-graphic processing unit architectures.” J.Chem. Phys., 131 (3), 034103, 2009.

I. S. Ufimtsev and T. J. Martınez. “Quantum chemistry on graphicalprocessing units. 2. Direct self-consistent-field (SCF) implementation.”J. Chem. Theo. Comp., 5 (4), 1004–1015, 2009; “Quantum chemistry ongraphical processing units. 3. Analytical energy gradients, geometryoptimization, and first principles molecular dynamics.” J. Chem. Theo.Comp., 5 (10), 2619–2628, 2009.

C. J. Woods, P. Brown, and F. R. Manby. “Multicore parallelization ofKohn-Sham theory.” J. Chem. Theo. Comp., 5 (7), 1776–1784, 2009.

P. Brown, C. J. Woods, S. McIntosh-Smith, and F. R. Manby. “Amassively multicore parallelization of the Kohn-Sham energy gradients.”J. Comp. Chem., 31 (10), 2008–2013, 2010.

R. Farmber, E. Bylaska, S. Baden and students. “(Car-Parrinello onGPGPUs).” Work in progress, 2011.


MP2 on GPUs

MP2 using the resolution-of-identity approximation involves a few largeGEMMs.

L. Vogt, R. Olivares-Amaya, S. Kermes, Y. Shao, C. Amador-Bedolla, andA. Aspuru-Guzik. “Accelerating resolution-of-the-identity second-orderMøller-Plesset quantum chemistry calculations with graphical processingunits.” J. Phys. Chem. A, 112 (10), 2049–2057, 2008.

R. Olivares-Amaya, M. A. Watson, R. G. Edgar, L. Vogt, Y. Shao, andA. Aspuru-Guzik. “Accelerating correlated quantum chemistrycalculations using graphical processing units and a mixed precision matrixmultiplication library.” J. Chem. Theo. Comp., 6 (1), 135–144, 2010.

A. Koniges, R. Preissl, J. Kim, D. Eder, A. Fisher, N. Masters, V. Mlaker,S. Ethier, W. Wang, and M. Head-Gordon. “Application acceleration oncurrent and future Cray platforms.” In CUG 2010, Edinburgh, Scotland,May 2010.


QMC on GPUs

QMC is ridiculously parallel at the node-level hence implementation of thekernel implies the potential for scaling to many thousands of GPUs.

A. G. Anderson, W. A. Goddard III, and P. Schroder. “Quantum MonteCarlo on graphical processing units.” Comp. Phys. Comm., 177 (3),298–306, 2007.

A. Gothandaraman, G. D. Peterson, G. Warren, R. J. Hinde, and R. J.Harrison. “FPGA acceleration of a quantum Monte Carlo application.”Par. Comp., 34 (4-5), 278–291, 2008.

K. Esler, J. Kim, L. Shulenburger, and D. Ceperley. “Fully acceleratingquantum Monte Carlo simulations of real materials on GPU clusters.”Comp. Sci. Eng., 99 (PrePrints), 2010.


CC on GPUs

Since we started this project, two groups have implemented non-iterative triplescorrections to CCSD on GPUs. These procedures involve a few very largeGEMMs and a reduction. At least two other groups are working on CC onGPUs but have not reported any results.

M. Melichercık, L. Demovic, and P. N. Michal Pitonak. “Acceleration ofCCSD(T) computations using technology of graphical processing unit.”2010.

W. Ma, S. Krishnaoorthy, O. Villa, and K. Kowalski. “GPU-basedimplementations of the regularized CCSD(T) method: applications tostrongly correlated systems.” Submitted to J. Chem. Theo. Comp., 2010.

A. E. DePrince and J. .R. Hammond. “Coupled-cluster theory ongraphics processing units I. The coupled-cluster doubles method.”Submitted to J. Chem. Theo. Comp., 2010.


Gaussian Integrals on GPUs

Quantum chemistry methods spend a lot of time generating matrix elements ofoperators in a Gaussian basis set. All published implementations are eitherclosed-source (commercial) or the source is unpublished, otherwise we would beusing these in our code.

I. S. Ufimtsev and T. J. Martınez. “Quantum chemistry on graphicalprocessing units. 1. Strategies for two-electron integral evaluation.” J.Chem. Theo. Comp., 4 (2), 222–231, 2008.

A. V. Titov, V. V. Kindratenko, I. S. Ufimtsev, and T. J. Martınez.“Generation of kernels for calculating electron repulsion integrals of highangular momentum functions on GPUs – preliminary results.” In Proc.SAAHPC 2010, pages 1–3, 2010.

A. Asadchev, V. Allada, J. Felder, B. M. Bode, M. S. Gordon, and T. L.Windus. “Uncontracted Rys quadrature implementation of up to Gfunctions on graphical processing units.” J. Chem. Theo. Comp., 6 (3),696–704, 2010.


Programming models for quantum many-body methods on ... · Programming models for quantum many-body...

Documents

Transcript of Programming models for quantum many-body methods on ... · Programming models for quantum many-body...