Required GPU scaling Enabling GAMESS for Extreme Computing ... · RI-MP2 RI-MP2/FMO OpenMP...

1
Base Challenge Problem Stretch Challenge Problem Enabling GAMESS for Extreme Computing in Chemistry and Materials Giuseppe Barca 3 , Colleen Bertoni 2 , Dmytro Bykov 7 , Laura Carrington 4 , Dipayan Datta 1,6 , Jorge Galvez 1,6 , Anastasia Guinina 1 , Taylor Harville 1,6 , Erik Jensen 7 , Sarom Leang 4 , Shirley Moore 7 , Buu Pham 1,6 , David Poole 1,6 , Alistair Rendell 3 , Tosaporn Sattasathuchana 1 , David Sherrill 5 , Masha Sosonkina 8 , Vaibhav Sundriyal 8 , Ananta Tiwari 4 , Bryce Westheimer 1,6 , Theresa Windus 1,6 , Peng Xu 1,6 , Federico Zahariev 1 , Mark Gordon *1,6 1 Ames Laboratory, 2 Argonne National Laboratory, 3 Australian National University, 4 EP Analytics, 5 Georgia Tech University, 6 Iowa State University, 7 Oak Ridge National Laboratory, 8 Old Dominion University (H 2 O) 2615 Communication Overhead EFMO: GPU Acceleration Paths How do we solve this problem? With quantum chemistry! RI-MP2 OpenMP GPU Offloading EFMO Workflow FRAGMENTATION: Many Body Expansion LibAccInt GPU Scaling Fragment Schemes The EFMO algorithm can be split into two stages: 1. STAGE 1 -> computation associated with fragments 2. STAGE 2 -> computation associated with fragment dimers *Currently on GPU In progress on GPU Planned on GPU 1-electron integrals ~N 2 /2 N=10 4 : ~10 8 1-electron integrals 2-electron integrals* ~N 4 /8 N=10 4 : ~10 16 2-electron integrals Cannot store in memory Recalculate on-the-fly or disk storage Different algorithms depending on l quantum numbers Hartree-Fock* Iterative solution Populate F matrices using integrals MP2* Requires partial 4-label transformation (4-LT) Scales ~N 5 RI-MP2* Reduces number of integrals in 4-LT Scaling can be reduced to ~N 4 HL: CCSD(T), CR-CC(2,3) Requires full 4-LT Scales ~N 7 RI can reduce both scaling and memory footprint Required GPU scaling In both base and stretch challenge problems, we assign a single fragment dimer to a single Aurora/Frontier GPU Each of these GPUs is 5x more powerful than one Summit GPU (NVIDIA V100) In order to run the dimer calculations efficiently , we need to scale up to at least 5 GPUs on Summit, each giving a ~ 7x speedup over a CPU socket 1 MSN ring fully solvated-> 61,776 total fragment dimers, of which 21,000 treated at the quantum (RI-MP2) level All MSN fragment dimers and most MSN-water dimers are treated at the quantum level All the water inside - but not outside - the pore is quantum Ensures accurate chemistry and physics 20 energy + gradient points to ensure sufficiently accurate potential energy surface model The calculation will use 21,000 GPUs, about 75% of Aurora/Frontier, for 7.5 - 8 h Allow for up to 5% communication overhead STAGE 1 N GDDI groups are formed each performing HF, MakeEFP, and RI-MP2 for a different fragment HF, MakeEFP, & RIMP2 on fragment 1 HF, MakeEFP, & RIMP2 on fragment 2 HF, MakeEFP, & RIMP2 on fragment N HF and RIMP2 dimer 1 HF and RIMP2 dimer 2 HF and RIMP2 dimer N(N-1)/2 GDDI groups redistribution Group 1 Group 2 Group N Group 1 Group 2 Group N(N-1)/2 4 MSN rings fully solvated-> 990,528 total fragment dimers, of which 28,000 treated at the quantum (RI-MP2) level All MSN fragment dimers and a sufficient number of MSN-water dimers are treated at the quantum level >80% of the water in the pore is quantum Ensures sufficiently accurate chemistry 40 energy + gradient points to ensure extremely accurate potential energy surface model The calculation will use 28,000 GPUs, that is, 100% of Aurora/Frontier, for 15 - 16 h Allow for up to 5% communication overhead Gather energy STAGE 1 STAGE 2 ... ... STAGE 2 MPI ranks in GDDI groups are redistributed, so that each can deal with a dimer Each GDDI group performs HF and RI-MP2 for a different fragment dimer Once the calculations on dimers is completed, the energy is gathered from the GDDI groups Wall time, sec Speedup Serial w/ 1 core of P9 342.697 0.04 OpenMP + ESSL dgemm w/ 42 threads on 2 P9 12.231 1.00 OpenMP offloading + nvblas dgemm on 1 V100 1.734 7.05 OpenMP offloading + cublas dgemm on 1 V100 1.983 6.17 OpenMP offloading + cublasXt dgemm on 1 V100 1.728 7.08 OpenACC offloading + cublas dgemm on 1 V100 1.905 6.42 OpenACC offloading + cublasXt dgemm on 1 V100 1.692 7.23 (H 2 O) 2615 split to 217 fragments Basis set used 6-31G(d)//cc-pVDZ-RI 256-768 KNL nodes, split into 256-768 groups (1 node/group) Each node creates 1 rank + a team of 64 threads. Wall time (s) vs. number of KNL compute nodes Each GDDI groups has three possible paths to accelerate HF, MakeEFP and RI-MP2 RI-MP2/FMO OpenMP Accelerator Offloading Average Communication % of EFMO/RHF # Atoms # Frags # Nodes % of Total Run-Time 304 6 1 4.22 592 11 2 4.26 1141 22 4 4.54 1738 32 8 4.96 Examine four increasing problem sizes using weak scaling (i.e. number of atoms/node is approximately constant) Scale from 1-32 nodes 304-1738 atoms Percent communication remains relatively constant at ~5% LibCChem Generalized Fock GPU Scaling LibCChem RI-MP2 GPU Scaling Run on Summit on a system of 150 H 2 O molecules. 1950 basis functions Used from 1 up to 9 GPUs 1 GPU per MPI rank The scaling with respect to number of GPUs is excellent (96.9% parallel efficiency). Run on Summit on a system of 150 H 2 O molecules 1950 basis functions Used from 1 up to 9 GPUs 1 GPU per MPI rank The scaling with respect to number of GPUs is nearly ideal (98.1% parallel efficiency) We ran our RI-MP2 on Summit on a system of 150 H 2 O molecules 3750 primary and 12600 auxiliary basis functions Calculations were performed on 11 nodes 11 MPI processes, each with 84 threads (42 P9 cores at SMT 2) and a variable number of GPUs 20 Å 80 Å Ratio of wall time to the wall time of (H 2 O) 2615 calculation using 16,384 threads

Transcript of Required GPU scaling Enabling GAMESS for Extreme Computing ... · RI-MP2 RI-MP2/FMO OpenMP...

Page 1: Required GPU scaling Enabling GAMESS for Extreme Computing ... · RI-MP2 RI-MP2/FMO OpenMP Accelerator Offloading Average Communication % of EFMO/RHF # Atoms # Frags # Nodes % of

Base Challenge Problem Stretch Challenge Problem

Enabling GAMESS for Extreme Computing in Chemistry and Materials Giuseppe Barca3, Colleen Bertoni2, Dmytro Bykov7, Laura Carrington4, Dipayan Datta1,6, Jorge Galvez1,6, Anastasia Guinina1, Taylor Harville1,6, Erik Jensen7, Sarom Leang4, Shirley Moore7, Buu Pham1,6, David Poole1,6,

Alistair Rendell3, Tosaporn Sattasathuchana1, David Sherrill5, Masha Sosonkina8, Vaibhav Sundriyal8, Ananta Tiwari4, Bryce Westheimer1,6, Theresa Windus1,6, Peng Xu1,6, Federico Zahariev1, Mark Gordon*1,6

1Ames Laboratory, 2Argonne National Laboratory, 3Australian National University, 4EP Analytics, 5Georgia Tech University, 6Iowa State University, 7Oak Ridge National Laboratory, 8Old Dominion University

(H2O)

2615

Communication Overhead

EFMO: GPU Acceleration Paths

How do we solve this problem? With quantum chemistry!

RI-MP2 OpenMP GPU OffloadingEFMO Workflow

FRAGMENTATION: Many Body Expansion LibAccInt GPU ScalingFragment Schemes

The EFMO algorithm can be split into two stages:1. STAGE 1 -> computation associated with fragments2. STAGE 2 -> computation associated with fragment dimers

*Currently on GPUIn progress on GPU

Planned on GPU

1-electron integrals• ~N2/2• N=104 : ~108 1-electron integrals

2-electron integrals*• ~N4/8• N=104 : ~1016 2-electron integrals• Cannot store in memory• Recalculate on-the-fly or disk storage• Different algorithms depending on l quantum numbers

Hartree-Fock*• Iterative solution• Populate F matrices using

integrals

MP2*• Requires partial 4-label transformation (4-LT)• Scales ~N5

RI-MP2*• Reduces number of integrals in 4-LT• Scaling can be reduced to ~N4

HL: CCSD(T), CR-CC(2,3)• Requires full 4-LT• Scales ~N7

• RI can reduce both scaling and memory footprint

Required GPU scaling

➢In both base and stretch challenge problems, we assign a single fragment dimer to a single Aurora/Frontier GPU

➢Each of these GPUs is 5x more powerful than one Summit GPU (NVIDIA V100)

➢In order to run the dimer calculations efficiently , we need to scale up to at least 5 GPUs on Summit, each giving a ~ 7x speedup over a CPU socket

➢1 MSN ring fully solvated-> 61,776 total fragment dimers, of which 21,000 treated at the quantum (RI-MP2) level

➢ All MSN fragment dimers and most MSN-water dimers are treated at the quantum level■ All the water inside - but not outside - the

pore is quantum■ Ensures accurate chemistry and physics

➢ 20 energy + gradient points to ensure sufficiently accurate potential energy surface model

➢ The calculation will use 21,000 GPUs, about 75% of Aurora/Frontier, for 7.5 - 8 h■ Allow for up to 5% communication overhead

STAGE 1➢ N GDDI groups are formed each performing HF, MakeEFP, and

RI-MP2 for a different fragment

HF, MakeEFP, & RIMP2 on fragment 1

HF, MakeEFP, & RIMP2 on fragment 2

HF, MakeEFP, & RIMP2 on fragment N

HF and RIMP2 dimer 1

HF and RIMP2 dimer 2

HF and RIMP2 dimer N(N-1)/2

GDDI groups redistribution

Group 1 Group 2 Group N

Group 1 Group 2 Group N(N-1)/2

➢4 MSN rings fully solvated-> 990,528 total fragment dimers, of which 28,000 treated at the quantum (RI-MP2) level

➢ All MSN fragment dimers and a sufficient number of MSN-water dimers are treated at the quantum level■ >80% of the water in the pore is quantum■ Ensures sufficiently accurate chemistry

➢ 40 energy + gradient points to ensure extremely accurate potential energy surface model

➢ The calculation will use 28,000 GPUs, that is, 100% of Aurora/Frontier, for 15 - 16 h■ Allow for up to 5% communication overhead

Gather energy

STA

GE

1S

TAG

E 2

...

...

STAGE 2➢ MPI ranks in GDDI groups are redistributed, so that each can

deal with a dimer

➢ Each GDDI group performs HF and RI-MP2 for a different fragment dimer

➢ Once the calculations on dimers is completed, the energy is gathered from the GDDI groups

Wall time, sec Speedup

Serial w/ 1 core of P9 342.697 0.04

OpenMP + ESSL dgemm w/ 42 threads on 2 P9 12.231 1.00

OpenMP offloading + nvblas dgemm on 1 V100 1.734 7.05

OpenMP offloading + cublas dgemm on 1 V100 1.983 6.17

OpenMP offloading + cublasXt dgemm on 1 V100 1.728 7.08

OpenACC offloading + cublas dgemm on 1 V100 1.905 6.42

OpenACC offloading + cublasXt dgemm on 1 V100 1.692 7.23

➢ (H2O)

2615 split to 217 fragments

➢ Basis set used 6-31G(d)//cc-pVDZ-RI

➢ 256-768 KNL nodes, split into 256-768 groups

(1 node/group)

➢ Each node creates 1 rank + a team of 64

threads.

➢ Wall time (s) vs. number of KNL compute nodesEach GDDI groups

has three possible paths to accelerate HF, MakeEFP and RI-MP2

RI-MP2/FMO OpenMP Accelerator Offloading

Average Communication % of EFMO/RHF

# Atoms # Frags # Nodes% of Total Run-Time

304 6 1 4.22

592 11 2 4.26

1141 22 4 4.54

1738 32 8 4.96

➢ Examine four increasing problem sizes using weak scaling

(i.e. number of atoms/node is approximately constant)

➢ Scale from

➢ 1-32 nodes

➢ 304-1738 atoms

➢ Percent communication remains relatively constant at

~5%

LibCChem Generalized Fock GPU Scaling LibCChem RI-MP2 GPU Scaling

➢ Run on Summit on a system of 150 H2O molecules.

➢ 1950 basis functions

➢ Used from 1 up to 9 GPUs➢ 1 GPU per MPI rank

➢ The scaling with respect to number of GPUs is excellent (96.9% parallel efficiency).

➢ Run on Summit on a system of 150 H2O molecules

➢ 1950 basis functions

➢ Used from 1 up to 9 GPUs➢ 1 GPU per MPI rank

➢ The scaling with respect to number of GPUs is nearly ideal (98.1% parallel efficiency)

➢ We ran our RI-MP2 on Summit on a system of 150 H2O molecules

➢ 3750 primary and 12600 auxiliary basis functions

➢ Calculations were performed on 11 nodes➢ 11 MPI processes, each with 84

threads (42 P9 cores at SMT 2) and a variable number of GPUs

20 Å

80 Å

➢ Ratio of wall time to the wall time of (H

2O)

2615 calculation

using 16,384 threads