Programming models for quantum many-body methods on ... · Programming models for quantum many-body...

62
Programming models for quantum many-body methods on multicore and manycore processors Jeff Hammond 1 and Eugene DePrince 2 1 Argonne 2 Georgia Tech 6 February 2011 Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Transcript of Programming models for quantum many-body methods on ... · Programming models for quantum many-body...

Page 1: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Programming models for quantummany-body methods on multicore and

manycore processors

Jeff Hammond1 and Eugene DePrince2

1 Argonne2 Georgia Tech

6 February 2011

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 2: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Abstract

The growing diversity in computer processor architectures poses aserious challenge to the computational chemistry community. Thistalk considers some of the key issues, including disjoint addressspaces, non-standard architectures and execution models, and thedifferent APIs required to use them. Specifically, we will describeour experiences in developing coupled-cluster methods for Intelmulticore, Intel MIC, NVIDIA Fermi and Blue Gene/Q in both aclean-sheet implementation and NWChem. Of fundamentalinterest is the development of codes that scale not only within thenode but across thousands of nodes; hence, the interactionbetween the processor and the network will be analyzed in detail.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 3: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Software Automation

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 4: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

The practical TCE – NWChem many-body codes

What does it do?

1 GUI input quantum many-body theory e.g. CCSD.

2 Operator specification of theory.

3 Apply Wick’s theory to transform operator expressions intoarray expressions.

4 Transform input array expression to operation tree using manytypes of optimization.

5 Produce Fortran+Global Arrays+NXTVAL implementation.

Developer can intercept at various stages to modify theory,algorithm or implementation.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 5: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

The practical TCE – Success stories

First parallel implementation of many (most) CC methods.

First truly generic CC code (not string-based):{RHF,ROHF,UHF}×CC{SD,SDT,SDTQ}×{T/Λ,EOM,LR/QR}Most of the largest calculations of their kind employ TCE:CR-EOMCCSD(T), CCSD-LR α, CCSD-QR β, CCSDT-LR α

Reduces implementation time for new methods from years tohours, TCE codes are easy to verify.

Significant hand-tuning by Karol Kowalski and others at PNNLwas required to make TCE run efficiently and scale to 1000processors and beyond.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 6: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Before TCE

CCSD/aug-cc-pVDZ – 192 b.f. – days on 1 processor

Benzene is close to crossover point between small and large.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 7: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Linear response polarizability

CCSD/Z3POL – 1080 b.f. – 40 hours on 1024 processors

This problem is 20,000 times larger on the computer than benzene.

J. Chem. Phys. 129, 226101 (2008).

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 8: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Quadratic response hyperpolarizability

CCSD/d-aug-cc-pVTZ – 812 b.f. – 20 hours on 1024 processors

Lower levels of theory are not reliable for this system.

J. Chem. Phys. 130, 194108 (2009).

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 9: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Charge-transfer excited-states of biomolecules

CR-EOMCCSD(T)/6-31G* – 584 b.f. – 1 hour on 256 cores

Lower levels of theory are wildly incorrect for this system.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 10: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Excited-state calculation of conjugated arrays

CR-EOMCCSD(T)/6-31+G* – 1096 b.f. – 15 hours on 1024 cores

J. Chem. Phys. 132, 154103 (2010).

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 11: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Summary of TCE module

http://cloc.sourceforge.net v 1.53 T=30.0 s

---------------------------------------------

Language files blank comment code

---------------------------------------------

Fortran 77 11451 1004 115129 2824724

---------------------------------------------

SUM: 11451 1004 115129 2824724

---------------------------------------------

Only <25 KLOC are hand-written; ∼100 KLOC is utility codefollowing TCE data-parallel template.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 12: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

My thesis work

http://cloc.sourceforge.net v 1.53 T=13.0 s

---------------------------------------------

Language files blank comment code

---------------------------------------------

Fortran 77 5757 0 29098 983284

---------------------------------------------

SUM: 5757 0 29098 983284

---------------------------------------------

Total does not include ∼1M LOC that was reused (EOM).

CCSD quadratic response hyperpolarizability was derived,implemented and verified during a two week trip to PNNL.Over 100 KLOC were “written” in under an hour.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 13: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Before Automation

(The Benz comes before the Ford)

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 14: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

What code are we going to generate?

Legacy design limits options

NVIDIA hardware was changing fast

CUDA Fortran didn’t exist in 2009

Eugene had a loop-based CEPA code in C. . .

Assume that data motion across PCI bus was limiting

Code changed a lot as CUBLAS supported streams, etc.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 15: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Coupled-cluster theory

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 16: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Coupled-cluster theory

The coupled–cluster (CC) wavefunction ansatz is

|CC 〉 = eT |HF 〉

where T = T1 + T2 + · · ·+ Tn.

T is an excitation operator which promotes n electrons fromoccupied orbitals to virtual orbitals in the Hartree-Fock Slaterdeterminant.

Inserting |CC 〉 into the Schodinger equation:

HeT |HF 〉 = ECCeT |HF 〉 H|CC 〉 = ECC |CC 〉

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 17: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Coupled-cluster theory

|CC 〉 = exp(T )|0〉T = T1 + T2 + · · ·+ Tn (n� N)

T1 =∑ia

tai a†aai

T2 =∑ijab

tabij a†aa†baj ai

|ΨCCD〉 = exp(T2)|ΨHF 〉= (1 + T2 + T 2

2 )|ΨHF 〉|ΨCCSD〉 = exp(T1 + T2)|ΨHF 〉

= (1 + T1 + · · ·+ T 41 + T2 + T 2

2 + T1T2 + T 21T2)|ΨHF 〉

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 18: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Coupled-cluster theory

Projective solution of CC:

ECC = 〈HF |e−THeT |HF 〉0 = 〈X |e−THeT |HF 〉 (X = S ,D, . . .)

CCD is:

ECC = 〈HF |e−T2HeT2 |HF 〉0 = 〈D|e−T2HeT2 |HF 〉

CCSD is:

ECC = 〈HF |e−T1−T2HeT1+T2 |HF 〉0 = 〈S |e−T1−T2HeT1+T2 |HF 〉0 = 〈D|e−T1−T2HeT1+T2 |HF 〉

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 19: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Notation

H = H1 + H2

= F + V

F is the Fock matrix. CC only uses the diagonal in the canonicalformulation.

V is the fluctuation operator and is composed of two-electronintegrals as a 4D array.

V has 8-fold permutation symmetry in V rspq and is divided into six

blocks: V klij , V ka

ij , V jbia , V ab

ij , V bcia , V cd

ab .

Indices i , j , k , . . . (a, b, c , . . .) run over the occupied (virtual)orbitals.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 20: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CCD Equations

Rabij = V ab

ij + P(ia, jb)

[T aeij I be − T ab

im Imj +1

2V abef T

efij +

1

2T abmnI

mnij − T ae

mj Imbie − Ima

ie T ebmj + (2T ea

mi − T eaim)Imb

ej

]

I ab = (−2Vmneb + Vmn

be )T eamn

I ij = (2Vmief − V im

ef )T efmj

I ijkl = V ijkl + V ij

ef Tefkl

I iajb = V iajb −

1

2V imeb T

eajm

I iabj = V iabj + V im

be (T eamj −

1

2T aemj )−

1

2Vmibe T

aemj

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 21: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Turning CC into GEMM 1

Some tensor contractions aretrivially mapped to GEMM:

I ijkl + = V ijef T

efkl

I(ij)(kl) + = V

(ij)(ef )T

(ef )(kl)

I ba + = V bc T

ca

Other contractions requirereordering to use BLAS:

I iabj + = V imbe T

eamj

Ibj ,ia + = Vbe,imTmj ,ea

Jbi ,ja + = Wbi ,meUme,ja

J jabi + = Wmebi U ja

me

J(ja)(bi) + = W

(me)(bi) U

(ja)(me)

Jzx + = W yx U

zy

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 22: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Turning CC into GEMM 2

Reordering can take as much time as GEMM. Why?

Routine flops mops pipelined

GEMM O(mnk) O(mn + mk + kn) yesreorder 0 O(mn + mk + kn) no

Increased memory bandwidth on GPU makes reordering lessexpensive (compare matrix transpose).

(There is a chapter in my thesis with profiling results and moredetails if anyone cares.)

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 23: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Hardware Details

CPU GPUX5550 2 X5550 C1060 C2050

processor speed (MHz) 2660 2660 1300 1150memory bandwidth (GB/s) 32 64 102 144

memory speed (MHz) 1066 1066 800 1500ECC available yes yes no yesSP peak (GF) 85.1 170.2 933 1030DP peak (GF) 42.6 83.2 78 515

power usage (W) 95 190 188 238

Note that power consumption is apples-to-oranges since CPU doesnot include DRAM, whereas GPU does.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 24: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Relative Performance of GEMM

GPU versus SMP CPU (8 threads):

0

100

200

300

400

500

600

700

0 1000 2000 3000 4000 5000

perf

orm

ance

(gi

gaflo

p/s)

rank

SGEMM performance

X5550C2050

Maximum:CPU = 156.2 GFGPU = 717.6 GF

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000

perf

orm

ance

(gi

gaflo

p/s)

rank

DGEMM performance

X5550C2050

Maximum:CPU = 79.2 GF

GPU = 335.6 GF

We expect roughly 4-5 times speedup based upon this evaluation.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 25: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Chemistry Details

Molecule o v

C8H10 21 63C10H8 24 72C10H12 26 78C12H14 31 93C14H10 33 99C14H16 36 108

C20 40 120C16H18 41 123C18H12 42 126C18H20 46 138

6-31G basis set

C1 symmetry

F and V from PSI3.

GPU code now runs from PSI3.

Talk to Eugene about PSI4 (GPL).

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 26: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CCD Algorithm

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 27: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CCD Performance Results

Iteration time in secondsour DP code X5550

C2050 C1060 X5550 Molpro TCE GAMESS

C8H10 0.3 0.8 1.3 2.3 5.1 6.2C10H8 0.5 1.5 2.5 4.8 10.6 12.7C10H12 0.8 2.5 3.5 7.1 16.2 19.7C12H14 2.0 7.1 10.0 17.6 42.0 57.7C14H10 2.7 10.2 13.9 29.9 59.5 78.5C14H16 4.5 16.7 21.6 41.5 90.2 129.3

C20 8.8 29.9 40.3 103.0 166.3 238.9C16H18 10.5 35.9 50.2 83.3 190.8 279.5C18H12 12.7 42.2 50.3 111.8 218.4 329.4C18H20 20.1 73.0 86.6 157.4 372.1 555.5

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 28: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CCD Performance Summary

0

100

200

300

400

500

600

700

C8H10

C10H8

C10H12

C12H14

C14H10

C14H16

C20

C16H18

C18H12

C18H20

performance (gigaflop/s)

molecule

C1060 SPC1060 DPC2050 SPC2050 DP

Xeon X5550 SPXeon X5550 DP

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 29: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Numerical Precision versus Performance

Iteration time in secondsC1060 C2050 X5550

molecule SP DP SP DP SP DP

C8H10 0.2 0.8 0.2 0.3 0.7 1.3C10H8 0.4 1.5 0.2 0.5 1.3 2.5C10H12 0.7 2.5 0.4 0.8 2.0 3.5C12H14 1.8 7.1 1.0 2.0 5.6 10.0C14H10 2.6 10.2 1.5 2.7 8.4 13.9C14H16 4.1 16.7 2.4 4.5 12.1 21.6

C20 6.7 29.9 4.1 8.8 22.3 40.3C16H18 9.0 35.9 5.0 10.5 28.8 50.2C18H12 10.1 42.2 5.6 12.7 29.4 50.3C18H20 17.2 73.0 10.1 20.1 47.0 86.6

Almost no work on single-precision or mixed-precision forstandard CPU packages even though it is worth 2x.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 30: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CCSD Equations

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 31: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CCSD Algortihm

Guiding principles:

Too many arrays to fit into GPU memory.

Copy-in every iteration in CCD was not a problem

Want multi-GPU, mixed CPU-GPU algorithms.

Design:

Persistent buffers but push all large arrays every iteration.

O(N6), O(N5) on GPU.

O(N5), O(N4) on CPU.

Dynamically schedule some diagrams each iteration toload-balance.

Overlap computation and communication with CUDA streams(CUBLAS compatible now).

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 32: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Hybrid CCSD

Iteration time in secondsHybrid CPU Molpro NWChem PSI3 TCE GAMESS

C8H10 0.6 1.4 2.4 3.6 7.9 8.4 7.2C10H8 0.9 2.6 5.1 8.2 17.9 16.8 15.3C10H12 1.4 4.1 7.2 11.3 23.6 25.2 23.6C12H14 3.3 11.1 19.0 29.4 54.2 64.4 65.1C14H10 4.4 15.5 31.0 49.1 61.4 90.7 92.9C14H16 6.3 24.1 43.1 65.0 103.4 129.2 163.7

C20 10.5 43.2 102.0 175.7 162.6 233.9 277.5C16H18 10.0 38.9 84.1 117.5 192.4 267.9 345.8C18H12 14.1 57.1 116.2 178.6 216.4 304.5 380.0C18H20 22.5 95.9 161.4 216.3 306.9 512.0 641.3

We do at least twice as many flops as Molpro due to CC formalism.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 33: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

More hybrid CCSD

molecule Basis o v Hybrid CPU Molpro CPU MolproCH3OH aTZ 7 175 2.5 4.5 2.8 1.8 1.1benzene aDZ 15 171 5.1 14.7 17.4 2.9 3.4

C2H6SO4 aDZ 23 167 9.0 33.2 31.2 3.7 3.5C10H12 DZ 26 164 10.7 39.5 56.8 3.7 5.3C10H12 6-31G 26 78 1.4 4.1 7.2 2.9 5.1

Molpro optimized for v/o � 1.

Our code doesn’t favor any limit except o, v � 1.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 34: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Programming Models

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 35: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Categorizing parallel models

MPI-1 Private address space, explicit communication,independent execution.

MPI-RMA Private-ish address space, explicit communication,independent execution.

OpenMP (loops) Shared address space, implicitcommunication, collective execution.

OpenMP (sections) Shared address space, implicitcommunication, semi-independent execution.

Pthreads Shared address space, independent execution.

GA Global view of data (not coherent), explicitcommunication, independent execution.

PGAS Shared address space, implicit/explicit communication,independent execution.

OpenMP is always implemented on top of Pthreads.PGAS can use MPI and/or Pthreads.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 36: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Categorizing models

Execution model: independent/tasks vs.cooperative/data-parallel

Memory sharing: private by default vs. public bydefault

Threading: parallel by default vs. serial bydefault

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 37: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Decision making process

MPI exists in a constructive cycle of ubiquityand optimization.

OpenMP is reaching ubiquity; optimization isdebatable.

Pthreads is ubiquitous but invisible (systemprogrammers only).

PGAS portability, especially performanceportability, is still emerging.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 38: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

OpenMP: from 1 to N or N to 1?

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 39: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

OpenMP: from 1 to N or N to 1?

#pragma omp parallel

{ /* thread-safe */ }#pragma omp single

{ /* thread-unsafe */ }#pragma omp parallel for

{ /* threaded loops */ }#pragma omp sections

{ /* threaded work */ }

{ /* thread-unsafe */ }#pragma omp parallel for

{ /* threaded loops */ }{ /* thread-unsafe */ }#pragma omp parallel for

{ /* threaded loops */ }/* thread-unsafe work */

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 40: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Addition thoughts

Pthreads very useful for driving communication

OpenMP NUMA catastrophe (on Linux

MPI+SHM vs. Pthreads (shared vs. private question)

Learn from linear algebra community:DAG-scheduling with Pthreads not OpenMP

– SK+KK TCE task pools in EOMCCSD, MRCC– our GPU-CC code using task parallelism throughout

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 41: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Beyond one node

NWChem, MADNESS, MPQC, GAMESS (?) burn core(s)for communication

Multi-socket, multi-GPU, multi-rail all gets nasty

Pthread support for affinity becomes more critical

NWChem data-server uses ∼ 25% of cores with Infiniband

Do your GPUs need communication helper threads?

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 42: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Pragmas

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 43: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

PGI pragmas

Q Can I call a CUDA kernel function from my PGI-compiled code?

A PGI is working on the design of a feature to allow you to callkernel functions written in CUDA or PTX or other languagesdirectly from your C or Fortran program. We will announce thisfeature when it is available.

Simple tensor contractions are done with DGEMM; nonsimple onesredistribution to simple ones (permutation).

CUBLAS DGEMM with CUDA permutation codes

Pragma DGEMM with pragma permutation codes

Pragma code only works with Fortran(C is a terrible language to optimize)

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 44: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

PGI pragmas

#pragma acc region

{for(i=0; i<num; i++)

for(j=0; i<num; j++)

for(k=0; i<num; k++)

c[i*1000+j] += ( a[i*1000+k] * b[k*1000+j] );

}

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 45: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

PGI pragmas

42, Accelerator region ignored

44, Accelerator restriction: size of the GPU copy of an array

depends on values computed in this loop

45, Accelerator restriction: size of the GPU copy of ’c’ is

unknown

Accelerator restriction: size of the GPU copy of an array

depends on values computed in this loop

Accelerator restriction: one or more arrays have unknown size

46, Accelerator restriction: size of the GPU copy of ’a’ is

unknown

Accelerator restriction: size of the GPU copy of ’b’ is unknown

Accelerator restriction: one or more arrays have unknown size

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 46: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

PGI: loop-based matrix multiplication

0

2

4

6

8

10

12

14

16

18

0 1000 2000 3000 4000 5000 6000

giga

flop/

s

rank

Matrix multiplication with pragmas

GPU KIJ

GotoBLAS on my Core2 laptop does 20 GF.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 47: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Pragmas

Usage:

Must allow native function calls

Pragmas to generate native object rather than offload

Explicit data movement

Explicit data movement is something scientists can do!

Things compilers cannot do well:

Optimize for cache (Netlib vs. Goto)

Generate network communication (PGAS)

Move data across PCI bus (GPU pragmas)

Prove me wrong. . .

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 48: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Acknowledgments

Computational Fellowship (AED)ANL-MCS Breadboard clusterANL-LCRC Fusion cluster

Dirac cluster

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 49: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Shameless promotion

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 50: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

SAAHPC 2012

What: Symposium on Application Accelerators in HPCWhen: July 10-12, 2012Where: Argonne (Chicago, IL)

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 51: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

TCE on GPUs

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 52: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

NWChem TCE module

Automatically generated code: easy to rewrite (inprinciple).

Data-parallel over tiles using Global Arrays.

Naive dynamic load-balancing (shared counter).

Standard GA programming model:check out, compute, check in

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 53: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

TCE CPU algorithm

for (P3,P4,H1,H2) in all (P,P,H,H) tiles:

if (my turn) and (nonzero symmetry):

allocate and zero buffer Jc

for (P5,P6) in all (P,P) tiles:

if (nonzero symmetry):

allocate and zero buffer Tc

get Tb from global T

reorder Tc

allocate and zero buffer Ic

get Ib from global I

reorder Ic

compute Jc += Tc*Ic

reorder Jc

acc Jc onto global J

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 54: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

TCE GPU algorithm

grab from pool and zero buffer pair (Tc,Tg)

grab from pool and zero buffer pair (Ic,Ig)

if (push gpucompute pull < cpucompute):

iget and push Tg from global T

iget and push Ig from global I

igpu reorder Tg

igpu reorder Ig

gpucompute Jg += Tg*Ig

else:

iget Tc from global T

iget Ic from global I

compute Jc += Tc*Ic

ireorder Jg

reorder Jc

pull and acc Jc += Jg

acc Jc onto global J

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 55: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Evaluating the GPU algorithm

Tilesizes not big enough to justify GPU all the time (bad).

Jg stays on the GPU through each pass (good).

Possibly good overlap of data movement (good).

Can unroll inner loop for maximum overlap (good), butthis doubles the memory required (bad).

Threaded CPU code helps memory issues but GA/ARMCInot thread-safe (funneled or serialized should be okay).

Many of the GPU-oriented optimizations will help withmulticore CPUs. Fully rewritten GPU TCE in NWChem willprobably coincide with porting to BGQ for this reason.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 56: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Partial bibliography

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 57: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

MD on GPUs (of many more)

All major MD packages have or will soon have a GPU implementation.

J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,and K. Schulten. “Accelerating molecular modeling applications withgraphics processors.” J. Comp. Chem., 28 (16), 2618–2640, 2007.

J. A. Anderson, C. D. Lorenz, and A. Travesset. “General purposemolecular dynamics simulations fully implemented on graphics processingunits.” J. Comp. Phys., 227 (10), 5342–5359, 2008.

P. Friedrichs, M. S.and Eastman, V. Vaidyanathan, M. Houston,S. Legrand, A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande.“Accelerating molecular dynamic simulation on graphics processingunits.” J. Comp. Chem., 30 (6), 864–872, 2009.

R. Yokota, T. Hamada, J. P. Bardhan, M. G. Knepley, and L. A. Barba.“Biomolecular electrostatics simulation by an FMM-based BEM on 512GPUs.” CoRR, abs/1007.4591, 2010.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 58: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

DFT on GPUs

K. Yasuda, “Accelerating density functional calculations with graphicsprocessing unit.” J. Chem. Theo. Comp., 4 (8), 1230–1236, 2008.

L. Genovese, M. Ospici, T. Deutsch, J.-F. Mehaut, A. Neelov, andS. Goedecker. “Density functional theory calculation on many-coreshybrid central processing unit-graphic processing unit architectures.” J.Chem. Phys., 131 (3), 034103, 2009.

I. S. Ufimtsev and T. J. Martınez. “Quantum chemistry on graphicalprocessing units. 2. Direct self-consistent-field (SCF) implementation.”J. Chem. Theo. Comp., 5 (4), 1004–1015, 2009; “Quantum chemistry ongraphical processing units. 3. Analytical energy gradients, geometryoptimization, and first principles molecular dynamics.” J. Chem. Theo.Comp., 5 (10), 2619–2628, 2009.

C. J. Woods, P. Brown, and F. R. Manby. “Multicore parallelization ofKohn-Sham theory.” J. Chem. Theo. Comp., 5 (7), 1776–1784, 2009.

P. Brown, C. J. Woods, S. McIntosh-Smith, and F. R. Manby. “Amassively multicore parallelization of the Kohn-Sham energy gradients.”J. Comp. Chem., 31 (10), 2008–2013, 2010.

R. Farmber, E. Bylaska, S. Baden and students. “(Car-Parrinello onGPGPUs).” Work in progress, 2011.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 59: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

MP2 on GPUs

MP2 using the resolution-of-identity approximation involves a few largeGEMMs.

L. Vogt, R. Olivares-Amaya, S. Kermes, Y. Shao, C. Amador-Bedolla, andA. Aspuru-Guzik. “Accelerating resolution-of-the-identity second-orderMøller-Plesset quantum chemistry calculations with graphical processingunits.” J. Phys. Chem. A, 112 (10), 2049–2057, 2008.

R. Olivares-Amaya, M. A. Watson, R. G. Edgar, L. Vogt, Y. Shao, andA. Aspuru-Guzik. “Accelerating correlated quantum chemistrycalculations using graphical processing units and a mixed precision matrixmultiplication library.” J. Chem. Theo. Comp., 6 (1), 135–144, 2010.

A. Koniges, R. Preissl, J. Kim, D. Eder, A. Fisher, N. Masters, V. Mlaker,S. Ethier, W. Wang, and M. Head-Gordon. “Application acceleration oncurrent and future Cray platforms.” In CUG 2010, Edinburgh, Scotland,May 2010.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 60: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

QMC on GPUs

QMC is ridiculously parallel at the node-level hence implementation of thekernel implies the potential for scaling to many thousands of GPUs.

A. G. Anderson, W. A. Goddard III, and P. Schroder. “Quantum MonteCarlo on graphical processing units.” Comp. Phys. Comm., 177 (3),298–306, 2007.

A. Gothandaraman, G. D. Peterson, G. Warren, R. J. Hinde, and R. J.Harrison. “FPGA acceleration of a quantum Monte Carlo application.”Par. Comp., 34 (4-5), 278–291, 2008.

K. Esler, J. Kim, L. Shulenburger, and D. Ceperley. “Fully acceleratingquantum Monte Carlo simulations of real materials on GPU clusters.”Comp. Sci. Eng., 99 (PrePrints), 2010.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 61: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

CC on GPUs

Since we started this project, two groups have implemented non-iterative triplescorrections to CCSD on GPUs. These procedures involve a few very largeGEMMs and a reduction. At least two other groups are working on CC onGPUs but have not reported any results.

M. Melichercık, L. Demovic, and P. N. Michal Pitonak. “Acceleration ofCCSD(T) computations using technology of graphical processing unit.”2010.

W. Ma, S. Krishnaoorthy, O. Villa, and K. Kowalski. “GPU-basedimplementations of the regularized CCSD(T) method: applications tostrongly correlated systems.” Submitted to J. Chem. Theo. Comp., 2010.

A. E. DePrince and J. .R. Hammond. “Coupled-cluster theory ongraphics processing units I. The coupled-cluster doubles method.”Submitted to J. Chem. Theo. Comp., 2010.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators

Page 62: Programming models for quantum many-body methods on ... · Programming models for quantum many-body methods on multicore and manycore processors Je Hammond1 and Eugene DePrince2 1

Gaussian Integrals on GPUs

Quantum chemistry methods spend a lot of time generating matrix elements ofoperators in a Gaussian basis set. All published implementations are eitherclosed-source (commercial) or the source is unpublished, otherwise we would beusing these in our code.

I. S. Ufimtsev and T. J. Martınez. “Quantum chemistry on graphicalprocessing units. 1. Strategies for two-electron integral evaluation.” J.Chem. Theo. Comp., 4 (2), 222–231, 2008.

A. V. Titov, V. V. Kindratenko, I. S. Ufimtsev, and T. J. Martınez.“Generation of kernels for calculating electron repulsion integrals of highangular momentum functions on GPUs – preliminary results.” In Proc.SAAHPC 2010, pages 1–3, 2010.

A. Asadchev, V. Allada, J. Felder, B. M. Bode, M. S. Gordon, and T. L.Windus. “Uncontracted Rys quadrature implementation of up to Gfunctions on graphical processing units.” J. Chem. Theo. Comp., 6 (3),696–704, 2010.

Jeff Hammond Electronic Structure Calculation Methods on Accelerators