Design and performance characterization of electronic ... · 2 N. A. ROMERO ET AL. Typical DFT...

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 0000; 00:1–27Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe

Design and performance characterization of electronic structurecalculations on massively parallel supercomputers:

A case study of GPAW on the Blue Gene/P architecture

N. A. Romero,1∗C. Glinsvad,2 A. H. Larsen,3,4 J. Enkovaara,5,6 S. Shende,7V. A. Morozov,1 and J. J. Mortensen,3

1Leadership Computing Facility, Argonne National Laboratory, Argonne, IL 60439, USA2Center for Individual Nanoparticle Functionality, Department of Physics, Technical University of Denmark, DK-2800

Kgs. Lyngby, Denmark3Center for Atomic-scale Materials Design, Department of Physics, Technical University of Denmark, DK-2800 Kgs.

Lyngby, Denmark4Nano-bio Spectroscopy Group and ETSF Scientific Development Centre, Departamento de Fısica de Materiales,

Universidad del Paıs Vasco, CSIC–UPV/EHU–MPC and DIPC, Avenida de Tolosa 72, E-20018 San Sebastian, Spain5CSC - IT Center for Science Ltd., P.O. Box 405 FI-02101 Espoo, Finland

6Department of Applied Physics, School of Science, Aalto University, P.O. Box 11100 FI-00076 AALTO, Finland7Performance Research Laboratory, University of Oregon, Eugene, OR, 97403, USA

SUMMARY

Density function theory (DFT) is the most widely employed electronic structure method due to its favorablescaling with system size and accuracy for a broad range of molecular and condensed-phase systems. Theadvent of massively parallel supercomputers has enhanced the scientific community’s ability to studylarger system sizes. Ground-state DFT calculations on ∼103 valence electrons using traditional O(N3)algorithms can be routinely performed on present-day supercomputers. The performance characteristics ofthese massively parallel DFT codes on >104 computer cores are not well understood. The GPAW code wasported an optimized for the Blue Gene/P architecture. We present our algorithmic parallelization strategyand interpret the results for a number of benchmark test cases.Copyright c© 0000 John Wiley & Sons, Ltd.

Received . . .

KEY WORDS: GPAW, electronic structure, DFT, Blue Gene, massive parallelization, high-performancecomputing

1. INTRODUCTION

Kohn–Sham density functional theory (DFT) [1, 2] continues to play an important role in thecomputational modeling of molecules and condensed-phase systems. Its popularity is due toa number of factors, such as accuracy over a wide range of systems (e.g., insulators, metals,molecules), its first-principles nature, and the ability to treat relevant system sizes on commodityLinux clusters. The performance of DFT software packages is not well understood by thecommunity at large. In an era of rapidly evolving high-performance computing (HPC) architectures,the design and performance characteristics of massively parallel DFT codes is of great importancebecause it is a means to enabling high-impact science.

∗Correspondence to: [email protected]

Copyright c© 0000 John Wiley & Sons, Ltd.Prepared using cpeauth.cls [Version: 2010/05/13 v3.00]

2 N. A. ROMERO ET AL.

Typical DFT calculations include geometry optimization and Born–Oppenheimer moleculardynamics (BOMD). These types of calculations require ∼103−6 electronic minimization stepsand thus large amounts of computer time. Even with multiple levels of parallelization and largesupercomputers, the time-to-solution† for these calculations can be on the order of several days.Strong-scaling performance bottlenecks are present over a wide range of system sizes relevant tousers due to serial parts (or parts with limited parallelization) in traditional O(N3) DFT algorithms,a behavior known as Amdahl’s Law [3].

GPAW [4, 5] is a real-space finite-difference (FD) [6, 7] DFT code based on the projectoraugmented-wave (PAW) [8] method. In addition to standard ground-state DFT, extensions suchas time-dependent DFT [9, 10], GW approximation [11], and Bethe-Salpeter equation [12] areimplemented in GPAW. It is also possible to use localized atomic orbitals (LCAOs) [13, 14], whichare complementary to real-space grids, and plane wave basis sets. In this work, we focus on theparallelization of real-space ground-state DFT calculations.

GPAW is written in the Python and C programming languages [15] and uses the MPI [16, 17]programming model for parallel execution. Originally developed on commodity Linux clusters,GPAW has now been successfully ported for use on Blue Gene/P [18] (and other massively parallelarchitectures such as the Cray XT series) on >105 computational cores. From the beginning,GPAW has been designed for large-scale scalability, and it currently supports multiple levelsof parallelization. Here, we investigate the performance of the two main non-trivial levels ofparallelization: real-space domain decomposition and band parallelization. We note that this two-level parallelization approach is also implemented in at least one other real-space DFT codeJuRS [19], as well as a number of plane wave based DFT codes including ABINIT [20], CPMD [21,22], Qbox [23], Quantum ESPRESSO [24], NWChem [25], PEToT [26], and VASP [27].

Section 2 describes the algorithm for minimizing the Kohn–Sham energy functional, andSection 3 is an in-depth description of the parallelization strategy employed. Performancebottlenecks based on Amdahl’s Law are derived by simple analytical considerations. Computationalresults for various system sizes are presented and analyzed in Section 4. Finally, a summary and anoutlook are provided in Section 5.

2. MINIMIZING THE KOHN–SHAM ENERGY FUNCTIONAL

There are two general families of methods for calculating the Kohn–Sham (KS) ground state: (i)Direct minimization methods and (ii) iterative methods where the KS Hamiltonian is diagonalizedin conjunction with density mixing. Iterative diagonalization methods contain two levels of iteration,one for the eigenvalue problem and an another one for the density, and they are often referred toas self-consistent field (SCF) methods [28]. GPAW employs SCF methods to obtain the KS groundstate and offers three different iterative eigensolvers: (i) residual minimization method by directinversion in the iterative subspace (RMM-DIIS) [29, 28], (ii) conjugate gradient method [30, 31],and (iii) Davidson’s method [32, 28]. The RMM-DIIS method is the main iterative eigensolver andhas been optimized for maximum scalability on high-performance computing platforms, and is thusour main focus in this work. A detailed description of the PAW method as implemented in the GPAWcode is found in Reference [5]; only the salient features are presented here.

Within the PAW approximation, the SCF method for minimizing the KS energy functional leadsto the generalized eigenvalue equation for the pseudo (PS) wave functions ψn,

Hψskn = εsknSψskn, (1)

where s is the spin index, k the k-point index, and n the band index. We limit our discussion mostlyto spin-unpolarized systems with a single k-point (Γ-point sampling of the Brillouin zone), and thusthe s and k indices are omitted. The Hamiltonian operator H is nonlinear as it depends on the wavefunctions. The S is the overlap operator, which can be defined in terms of the PAW transformation

†The total wall-clock time from beginning to end of the calculation.

Copyright c© 0000 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (0000)Prepared using cpeauth.cls DOI: 10.1002/cpe

GPAW CODE ON MASSIVELY PARALLEL SUPERCOMPUTERS 3

operator T such thatS = T †T = 1 +

∑a

∑i1i2

|pai1〉∆Sai1i2〈p

ai2 |. (2)

Here a is an atomic index and i1, i2 are combination indices for the principal, angular momentum,and magnetic quantum numbers. The projector functions pai are localized to an augmentation sphereand ∆Sa

i1i2are the atomic corrections to the overlap operator.

The Hamiltonian operator in Equation 1 is given by

H = −1

2∇2 + v +

∑a

∑i1i2

|pai1〉∆Hai1i2〈p

ai2 |, (3)

where ∆Hai1i2

represents the atomic corrections and the effective potential is

v = vcoul + vxc +∑a

va. (4)

The Coulomb potential vcoul satisfies the Poisson equation for the charge density (including thecompensation charge contribution) ∇2vcoul = −4πρ, while vxc is the exchange–correlation (XC)potential and va represents fixed local potentials around each atom.

The SCF solution of Equation 1 using the RMM-DIIS algorithm proceeds as follows:

1. Initial guess for PS wave functions ψn given by the LCAO initialization.2. Orthogonalization of wave functions:

(a) Construction of overlap matrix: Smn = 〈ψm|S|ψn〉.(b) Cholesky factorization of the overlap matrix: S = LTL.(c) Rotation of wave functions: L−1ψn → ψn.

3. Calculation of the PS density: n(r) =∑

n fn|ψn(r)|2 +∑

a nac (r), where fn represents the

occupation numbers and nac is a smooth PS core density equal to the all-electron (AE) coredensity outside the augmentation sphere.

4. Calculation of the Coulomb and effective potential: vcoul, v.5. Application of the KS operator on the PS wave functions: Hψn.6. Subspace diagonalization:

(a) Construction of the Hamiltonian matrix: Hmn = 〈ψm|H|ψn〉.(b) Diagonalization of the Hamiltonian matrix gives the eigenvalues and eigenvectors in the

subspace: εn,Wmn.(c) Rotation of wave function: Wψn → ψn .

7. Update of wave functions by residual minimization:

(a) Calculation of residuals: Rn = (H − εnS)ψn.(b) Improvement of the PS wave functions: ψ′n = ψn + λPRn , where the preconditioner is

the approximate inverse of the kinetic energy operator T , P = T−1. The step length ischosen to minimize the norm of the residual for the new guess: R′n = (H − εnS)ψ′n.

(c) Final trial step with the same step length: ψ′n = ψn + λPRn + λPR′n.

The algorithm is equivalent to RMM-DIIS as described in Ref. [28] with the subspacedimension of two.

8. If energy, density, and wave functions have not converged, return to Step 2; otherwise, theprocess is done.

The equations are discretized by representing the wave functions, densities, and other variableswith their values at points in a uniform real-space grid; for example, ψn(r) = ψng where g is acombination index for the grid points in the Cartesian dimensions. In the discretized form theHamiltonian can be written as

Hgg′ = −1

2Lgg′ + vgδgg′ +

∑i1i2

pai1g∆Hai1i2 p

ai2g′ (5)



where Lgg′ is the finite-difference stencil for the Laplacian. The Hamiltonian is sparse, and becausethe RMM-DIIS algorithm requires only evaluation of matrix-vector products, the full matrix is neverstored.

3. PARALLELIZATION STRATEGY

The parallelization strategy employed in GPAW is co-designed with the memory requirements,floating point operations, and communication patterns of the RMM-DIIS algorithm. These resourcerequirements are related to three extensive parameters characterizing the size of the calculation: (i)total number of bands Nb (or states in the case of finite systems),‡ (ii) total number of grid pointsNg, and (iii) total number of PAW projectors Np, which is related to the total number of atoms Na

and their chemical species. For real physical systems, these parameters are inter-related:

• Ng � Nb, which is the mathematical condition motivating iterative diagonalization instead ofdirect diagonalization.

• Np ≥ Na, since each atom generally has at least one projector function for each valence state.

The goal of the parallelization strategy presented here is to achieve the best weak- and strong-scaling performance. Weak-scaling is primarily concerned with enabling calculations on largersystem sizes, which might otherwise not be possible due to hardware constraints (e.g., memory),while strong-scaling is primarily concerned with reducing the time-to-solution for a fixed systemsize. The performance of a DFT code is determined by a number of aspects that cross-cut hardwareand software, as summarized here:

1. Initialization phase (which includes reading the input file and PAW pseudopotentials, as wellas the LCAO calculation which generates the initial PS wave functions);

2. Scalable data structures;3. Amdahl’s Law bottlenecks in the main SCF algorithm;4. Scalable communication patterns;5. Single-core peak performance of key algorithmic kernels; and6. Check-pointing (with a frequency determined by the user).

In this paper we focus on the Items 2 through 4 from this list in Sections 3.1–3.3. We briefly touchupon the remaining items in Section 3.4.

3.1. Memory and Matrix Distribution

The first hardware resource that becomes exhausted as the system size increases is the memory pernode. The matrix requiring the largest amount of memory stores the PS wave functions. The PSwave functions ψng are stored naturally as a four-dimensional matrix (as g is a combination indexfor the three Cartesian dimensions). The storage requirement for ψng is thus O(NbNg). Beyond acertain system size, it becomes necessary to distribute ψng along the band index n and grid index gsimultaneously. Figure 1 illustrates the simplicity behind this approach, which is inspired by well-known concepts in the parallel dense linear algebra community [33, 34, 35] and is now employedin several DFT codes [20, 21, 22, 23, 24, 25, 26, 27]. The PS wave functions ψng are distributed ona B×G process grid, where B and G are the total number of band groups and real-space domains,respectively. Note that G is the product of G1, G2, and G3, the domain decomposition along eachof the three axes of the unit cell. Thus in our parallelization strategy, there are a total of B×G MPItasks, with each task assigned a single contiguous range of rows and columns of ψng (this layoutis commonly known as a two-dimensional block layout). For algorithmic simplicity, we require

‡Note that Nb is not to be confused with Nval, the total number of valence electrons. In direct minimization methods, Nb

is typically equal to Nval/2, which is the number of occupied bands, while iterative diagonalization methods, includingRMM-DIIS, require extra unoccupied bands to achieve convergence, hence Nb > Nval/2.



Nb modB = 0 but allowNgi modGi 6= 0 for i ∈ 1, 2, 3 since doing otherwise would lead to changesin the grid spacing as a function of the number of MPI tasks, which is undesirable.

B

band_comm(same domain group, different band group)

domain_comm(same band group,

different domain group)

G

Figure 1. Two-dimensional grid layout of PS wave functions. Each square represents an MPI task andbelongs to two MPI communicators: band comm (blue) and domain comm (red). Each MPI task is assigneda single contiguous slice of ψng; band comm contains MPI tasks with different n indices but the same g

indices; domain comm contains MPI tasks with different g indices but the same n indices.

The memory requirements for a spin-unpolarized Γ-point calculation are shown in Table I. Notethat both the order of memory storage as well as the distribution vary considerably among thedifferent matrices. Although at first glance it would seem natural to distribute all matrices acrossthe entire B×G process grid, this turns out to be impractical because it would lead to an increase incommunication among the MPI tasks. Thus the matrices in GPAW are either completely replicated,partially replicated, or fully distributed with respect to the B×G process grid.

Table I. Memory requirements for the RMM-DIIS algorithm. Order of memory storage for matrices asfunction of extensive system parameters and distribution with respect to the B×G process grid.

Matrix Memory Storage B-axis G-axisψng O(NbNg) distributed distributedpai (r) O(Np) replicated distributed

n(r), v(r), vcoul(r) O(Ng) replicated distributedHmn, Smn,Wmn, O(N2

b ) distributed replicatedneighbor list O(Na) replicated replicated

While the PS wave functions are fully distributed along the B and G-axes, the real-space PAWprojectors are distributed along the G-axis and are partially replicated along the B-axis. The PAWprojectors are evaluated in real-space and only exist inside an atom-centered augmentation sphere.This has the advantage of requiring only O(Np) storage instead of the O(NpNg) storage that isneeded in plane wave DFT codes using reciprocal-space projectors. There are also a number of



matrices (e.g., n(r)) that require O(Ng) memory storage and are distributed along the G-axis andreplicated along B-axis.

The Hamiltonian Hmn and overlap Smn matrices in the subspace of computed bands requireO(N2

b ) storage. These matrices are not frequently distributed in DFT codes since they are negligiblein size for most problems. Clearly, as the system size Nb increases, Hmn and Smn will eventuallyneed to be distributed across multiple nodes. Hmn and Smn are distributed only along the B-axis (one-dimensional block layout) and are replicated across the G-axis, effectively slicing thesematrices along the row or column index but not both. Although these matrices are two-dimensional,a one-dimensional distribution has not proven to be a memory bottleneck for problems up toNb ∼ 104. Lastly, there are O(Na) arrays that are completely replicated since they require modestamounts of memory even for Na ∼ 103, which is near the practical limits of O(N3) DFT methods.

3.2. Floating Point Operations and Amdahl’s Law

The strong-scaling efficiency of a parallel code is limited by the least parallel component ofthe algorithm. We identify these Amdahl’s Law bottlenecks by analyzing GPAW’s RMM-DIISalgorithm, as well as confirming these results with computational experiments. Table II is abreakdown of the computational complexity for the individual steps in the SCF cycle. Although DFTmethods are often associated with having O(N3) computational complexity, the full description ofthe complexity contains five types of terms (in decreasing order): (i) O(N2

bNg), (ii) O(N3b ), (iii)

O(NbNg), (iv) O(NbNp) and (v) O(Ng).

Table II. Computational complexity and scaling of the RMM-DIIS algorithm in terms of extensive systemparameters. The scaling column is with respect to the B ×G process grid. The ScaLAPACK process grids

are typically chosen to be O(B2) in size.

Description Computational Computationalof Operation Complexity Scalingvcoul(r), vxc(r), v(r) O(Ng) O(Ng/G)density mixing O(Ng) O(Ng/G)

(1st term in Eq. 3) ×ψng O(NbNg) O(NbNg/(BG))

(2nd term in Eq. 3) ×ψng O(NbNg) O(NbNg/(BG))

(3rd term in Eq. 3) ×ψng O(NbNp)constructing n(r) O(NbNg) O(NbNg/(BG))subspace diagonalization:εn,Wmn O(N3

b ) O(N3b /B

2)subspace Cholesky decompositionincluding inversion: L−1 O(N3

b ) O(N3b /B

2)constructing Hmn, Smn O(N2

bNg) O(N2bNg/(BG))

rotation of wave functions O(N2bNg) O(N2

bNg/(BG))

A parallelization strategy employing only domain-decomposition (parallelization along the G-axis), leads to Amdahl’s Law bottlenecks arising from the O(N3

b ) terms. GPAW is able to calculatethese terms either using LAPACK [36], in which case the computation is replicated across all MPItasks, or ScaLAPACK [37], in which case these terms are computed in a parallel fashion usinga small subset of MPI tasks since Nb � Ng. Simultaneous parallelization on domains and bands(parallelization along the B- and G-axis) leads to an additional Amdahl’s Law bottleneck arisingfrom the O(Ng) terms whose computation is replicated along the B-axis.

Although it is not an Amdahl’s Law bottleneck, the evaluation of the PAW projector terms leads tocomputational load imbalance in calculations where atoms are not evenly distributed geometrically.The computational complexity of evaluating the PAW projectors in real space is O(NbNp), muchlower than in reciprocal space, which is O(NbNpNg). This greatly reduces the total number offloating-point operations, especially for larger system sizes. However, the distribution of PAW



projectors (and their associated computational work) is determined by the domain-decompositionparallelization scheme. The total real-space grid is partitioned into equal-sized subdomains thatmay or may not contain projectors.§ The effects are most readily observed for slab calculationswith a vacuum, or any other configuration of atoms that leads to a non-uniform distribution ofprojectors, and thus computational load imbalance, both in terms of memory and in terms offloating-point operations. It is worth noting that this load imbalance could potentially be reduced byleveraging a computational runtime environment that can handle both global task scheduling (e.g.,MADNESS [38]) and global shared-memory arrays (e.g., Global Arrays [39]), but this is currentlyan open issue for GPAW.

In spin-polarized cases and in periodic systems with k-points, Equation 1 can be solvedindependently for each spin and k-point. The computational complexity of most of the operationsincreases by a factor of NsNk where Ns is the number of spins and Nk is the number of k-points. GPAW can parallelize over both the spin and k-point degrees of freedom, and becausecommunication is needed only when constructing the charge density, the parallelization is veryefficient. However, because the number of spins is limited to two, and the number of k-points isroughly inversely proportional to the number of atoms, the spin and k-point parallelization has onlylimited use in very large-scale calculations.

3.3. Communication Patterns

The GPAW code has four different types of communication patterns:

1. Halo-exchange communication patterns arising from FD stencil operations on n, vcoul, andψng where boundary points have to be transferred between processes.

2. Projector-wave function integrals P ani =

∑ga paiga ψnga where ga are defined inside the

augmentation sphere of atom a.3. Parallel matrix multiplies for the construction of the Hmn, Smn, and rotation operations onψng.

4. Dense linear algebra operations on Nb×Nb matrices, which are handled by ScaLAPACK.

Because the augmentation spheres extend typically only to adjacent domains, their communicationpattern is three-dimensional nearest neighbor, similar to the halo-exchange.

Because of the specialized nature of the parallel matrix multiplies in GPAW, we chose to writeour own routines on top of MPI and BLAS libraries. The routines themselves are complicated due tothe need to support slightly irregular distributions of ψng, as well as the PAW terms. The calculationof the matrix elements Hmn and Smn requires communication between all the MPI tasks withinband comm. However, by utilizing a pipeline with overlapping communication and computation,the communication can in practice be performed as one-dimensional nearest-neighbor exchange. Amore detailed description of our parallel matrix multiply algorithms is presented in Appendix B.Here we summarize the main features of our implementation.

1. Two distinct phases. A one-dimensional systolic ring algorithm implemented using MPI Isendand MPI Irecv on band comm, immediately followed by MPI Reduce on domain comm. TheMPI communicators band comm and domain comm are shown in Figure 1.

2. Internal blocking inside the ring algorithm to reduce memory usage and MPI message size.The amount of memory used for blocking is specified with the buffer size parameter in unitsof kilobytes (KiB).

3. Use of ScaLAPACK for O(N3b ) operations. Our present implementation uses a single

two-dimensional process grid for all ScaLAPACK operations in a given real-space DFTcalculation, but a different two-dimensional process grid is allowed for the LCAOinitialization.

§GPAW allows Ng mod G 6= 0 which can lead to subdomains that are slightly different in volume; subdomains withsignificantly different volumes are not supported.



The algorithm for the rotation of wave functions by the unitary matrix W is similar except thatMPI Reduce is not needed andW is partially replicated, as noted in Section 3.1. Our parallel matrixmultiply algorithms are scalable in terms of communication and memory, but a disadvantage is thatit does not make use of as many networks links as would be possible if the MPI Isend and MPI Irecvwere replaced by MPI collectives, particularly MPI Bcast.

An important point in optimizing the communication pattern of GPAW is recognizing thatcommunication patterns are radically different in terms of the amount of communication per MPItask. For example, the halo-exchange communication is proportional to the surface area of thesubdomain (Ng/G)2/3, while parallel matrix multiplies are proportional to the volume of thesubdomain Ng/G. On the other hand, the projector-wave function integrals are independent of thedomain size.

In principle, a four (or higher) dimensional torus network has a topology that can accommodatenearest neighbor communications for all these distinct communication patterns simultaneously. Forexample, the Blue Gene/P node can operate in virtual node (VN) mode, with each of the four coresin a node running an MPI task, effectively adding a very short fourth dimension to the three-dimensional torus network topology. We show in the results section that GPAW’s scalability isprimarily determined by optimizing the mapping so that the communication in the parallel matrixmultiplies are nearest neighbors, since this is the worst communication pattern in the RMM-DIISalgorithm.

3.4. Initialization Phase, Single-core Peak Performance and Check Pointing

The initialization phase is defined to occur at the beginning of the calculation and to occur onlyonce per calculation. It includes the LCAO initialization which itself is an entirely separate DFTalgorithm that relies on direct diagonalization instead of iterative diagonalization. The GPAW LCAOcode is described in greater detail in Larsen et al. [13]. Although the LCAO code has not beenoptimized as thoroughly as the real-space code, it contains sufficient parallelism to generate initialPS wave functions for a DFT calculation with at least Nb ∼ 104.

Single-core peak performance is a measure of the efficiency with which software can use thefloating point units on the computer cores. This efficiency is heavily dependent on the nature ofalgorithmic kernel and hardware details. Today’s computer architectures use deep hierarchical levelsof cache memory and have relatively slow access speeds to main memory. In the strong-scaling limit,the arrays used in a DFT calculation will become smaller for a fixed problem size. Since smallerarrays lead to a higher ratio of load operations to float operations, application developers tuningtheir computational kernels must work increasingly harder to retain reasonable percentages of thesingle-core peak performance.

For instance, vendor-supplied BLAS [40] libraries usually have double-precision general matrixmultiply (DGEMM) function tuned to obtain >75% of single-core peak-performance on largesquare arrays. However, the arrays used in the GPAW’s DGEMM calls include long skinny aswell as small square arrays. The results obtained here used the optimized DGEMM available inthe single-threaded version of IBM’s ESSL [41], which has been primarily optimized for largesquare matrices where it obtains >80% of the single-core peak performance. In sharp contrast,DGEMM as used in GPAW typically only obtains 40%−60% of the single-core peak performanceusing ESSL. Furthermore, as a DFT calculation is strong scaled to larger numbers of computercores, the dimensions of these arrays become even smaller, which reduces the single-core peakperformance and increases the total wall-clock time of a calculation.

4. RESULTS

4.1. Preliminaries: Test Cases, Timers, TAU, and Blue Gene/P architecture

The performance characteristics of DFT codes are perhaps unique in the computational sciencecommunity, in that there is no single algorithmic kernel that dominates SCF algorithms. Theirdistinguishing traits can be understood by studying the weak- and strong-scaling efficiency as a



function of system size parameters (Na, Nb, Ng, Np). In this section, we obtain performance data onseveral test cases with the GPAW code to gain insight into:

• The computational kernels that dominate the wall-clock time.• The computational kernels that limit scalability.

This information allows us to construct a phenomenological map of GPAW’s performancecharacteristics. Such a phenomenological map is valuable not only to DFT users, but also to DFTdevelopers who are seeking to port their codes to new architectures and obtain good performance.

We study the performance of GPAW by collecting timing data for DFT calculations of bulk (face-centered cubic) gold for a variety of repeated cubic unit cells (L1×L2×L3) listed in Table III. Thecell sizes were chosen to facilitate comparison, while bulk gold was chosen for no other reason thanto distinguish it from the proverbial test case in the condensed matter community (i.e., bulk silicon).As the length of the cell is doubled in a dimension, the total number of atoms, bands, and grid pointsis also doubled.

ScaLAPACK process grids were chosen to be square in shape and were not exhaustively tunedfor performance. Our current implementation constrains all ScaLAPACK operations within a givencalculation to a single process grid. Ideal weak-scaling with a fixed time per SCF cycle requiresrescaling a two-dimensional process grid by a ratio proportional to N3

b . However, rescaling a two-dimensional process grid by a ratio proportional to

√(N3

b ) is not generally possible, so we chose torescale the ScaLAPACK process grid by a ratio proportional toN2

b instead. While it has been shownthat 2.5- and 3-dimensional parallel dense linear algebra algorithms [42, 43] are required to achieveideal scaling, we still expect that the two-dimensional algorithms that are used by ScaLAPACK willlead only to a linear increase in the time-to-solution as Nb increases; we show later in this sectionthat this is not the case.

Table III. Extensive system and ScaLAPACK process grid parameters for bulk gold test cases.

L1 L2 L3 Na Nb Ng ScaLAPACK process grid2 4 4 128 712 (64, 128, 128) 2×24 4 4 256 1424 (128, 128, 128) 4×44 4 8 512 2848 (128, 128, 256) 8×84 8 8 1024 5696 (128, 256, 256) 16×16

All calculations used a grid spacing of h = 0.125A and a gold PAW setup containing 18 PAWprojector functions. Setups version 0.8.7929 was used which can be obtained online.¶ Ground-state DFT calculations were performed in the local density approximation (LDA) [2] for 10 SCFiterations with a Γ-point sampling of the Brillouin zone. The template for our input file is includedin Appendix A.

The performance data presented here was collected with the Tuning and Analysis (TAU)Performance System R© [44, 45, 46, 47] (henceforth referred to simply as TAU). It consists of anumber of tools for instrumenting code as well as storing and analyzing performance data. We havecollected performance data by swapping out GPAW’s default Python timers with those from TAU(shown as BGP TIMERS in our figures). The performance data shown will be based on timersthat are normally defined in a ground state DFT calculation (excluding those from the LCAOinitialization) and are shown in Table IV. A special font is used in the main text to identify atimer. The timers represent exclusive timings unless noted otherwise. A number of timers that takeup a negligibly small fraction of the total time have been intentionally omitted to make the figureseasier to read.

The Blue Gene/P architecture has three networks for node-to-node communication [18]: (i)a global collective network, (ii) a global barrier network and (iii) a three-dimensional torus

¶https://wiki.fysik.dtu.dk/gpaw/setups/setups.html



Table IV. Timers in the SCF cycle as obtained from the output of the GPAW code. Right (left) indentationindicates child (parent) relationship in the hierarchy. Timers are non-overlapping at the same level. The sumof child timers does not necessarily provide complete coverage of the parent timer. Timers marked with an

asterisk have been absorbed into their parent timer.

SCF-cycleDensity

Atomic density matrices*Mix*Multipole moments*Pseudo density*Symmetrize density*

HamiltonianAtomic*

XC CorrectionCommunicate energiesHartree integrate/restrict*PoissonXC 3D grid*vbar*

Orthonormalize*Blacs Band Layouts*

Inverse Choleskycalc_s_matrixprojectionsrotate_psi

RMM-DIISApply hamiltonianpreconditionprojections

Subspace diag*Blacs Band Layouts*DiagonalizeDistribute results

calc_h_matrixApply hamiltonian

rotate_psi

network. The global collective and global barrier networks handle MPI collectives and barriercalls, respectively, on the MPI COMM WORLD communicator, while the three-dimensional torusnetwork handles all other types of MPI calls. The three-dimensional torus network is a low-latencyhigh-bandwidth network where each node is directly connected to its six nearest neighbors with bi-directional links. A unique feature of the the Blue Gene/P architecture is that it supports partitionswhich are contiguous and allow different mappings of MPI tasks [48]. On a Blue Gene/P partition, auser’s calculation is isolated and hence protected from external network traffic. A mapping refers tothe assignment of MPI tasks on the three-dimensional torus network and turns out to be an importantaspect of strong-scaling a GPAW calculation. For the work presented here, we use an MPI-onlyversion of the GPAW code. Thus, all our results were obtained in VN mode where one MPI taskruns on each of the four cores of the Blue Gene/P node.



4.2. Domain Decomposition and Band Parallelization

We begin by showing two performance enhancements implemented in the RMM-DIIS algorithm:(i) application of the KS operator to multiple wave functions as specified by the blocksize parameterand (ii) band parallelization (described in Section 3.2). Figure 2 shows the effect of these algorithmicstrategies in isolation, as well as in tandem, for the smallest of our bulk gold test cases (2×4×4).The baseline case, corresponding to blocksize = 1 andB = 1, shows rapid performance degradationabove 256 MPI tasks with {Gi} = (4, 8, 8). The strong-scaling efficiency of the baseline caseis significantly improved even for the B = 1 case by simply increasing to blocksize = 10. Thisimprovement comes from a reduction in communication latency due to the consolidation of smallMPI messages into larger ones. The total number of send/receives per MPI task is proportional to thetotal the number of Hψng operations. Thus, applying the KS operator to multiple wave functionsat a time, instead of one wave function at a time, reduces time spent in communication.‖ Hence,performance data collected in later sections were obtained with blocksize = 10.

The efficacy of a domain decomposition only approach cannot continue indefinitely, becauseit is well known that these types of algorithms are limited by a surface-to-volume effect [49].In Figure 2, we observe the strong-scaling efficiency falling below 75% at 1024 MPI tasks with{Gi} = (8, 8, 16) for blocksize = 10 and B = 1. This behavior is in line with observations on othercomputer architectures, and we generally consider Ngi/Gi ∼ 10 for i ∈ 1, 2, 3 to be the minimumefficient size for a real-space domain on an MPI task. The best strong-scaling performance isachieved by simultaneously parallezing over bands and domains along with blocksize = 10, provingthat our approach is tenable even for the smallest of our test cases.

0 256 512 768 1024MPI Tasks

0.0 0.0

10.0 10.0

20.0 20.0

30.0 30.0

40.0 40.0

50.0 50.0

60.0 60.0

70.0 70.0

80.0 80.0

90.0 90.0

100.0 100.0

Str

on

g-s

cali

ng

Eff

ieci

ency

(%

)

blocksize=1, B=1

blocksize=1, B=4

blocksize=10, B=1

blocksize=10, B=4

Figure 2. Strong-scaling efficiency as obtained by inclusive SCF-cycle timer for 10 SCF iterations ofa ground-state DFT calculation on bulk gold 2×4×4 as a function of the blocksize and B. Relevant DFT

calculations parameters are listed Table III.

We examine the blocksize = 10 andB = 1 case for bulk gold 2×4×4 in more detail to understandthe underlying performance degradation. Figure 3a shows a breakdown of the inclusive time inSCF-cycle as a function of MPI tasks. This type of plot can be obtained from TAU’s PerfExplorer

‖In our current implementation, only the communication from the projector-wave function integral is aggregated. Thecommunication from the FD stencil operations on the wave function still occurs one band at a time.



and is called a fractional stacked-bar chart. The colors in the data bar of the stacked-bar chart areorganized from left to right in the same order as the legend. Figure 3a shows that the strong-scalingbottlenecks at 1024 MPI tasks are Communicate energies, Distribute results, andprecondition. We expect precondition to be among the computational kernels that areaffected by a surface-to-volume effect, especially since these operations occur on a grid coarserthan the wave function grid (2h instead of h).

The presence of Communicate energies and Distribute results as bottlenecks isat first glance perplexing, but can be understood by looking at where these timers are located in theGPAW code. Distribute results and Communicate energies contain MPI collectivesthat are called after XC Correction and Diagonalize, respectively. Thus time spent in thesetimers is indicative of load imbalance in the precedent timers. While there are a number of ways toshow this load imbalance and the relationship between these timers, it is most elegantly shown by anormal probability plot [50] obtained by PerfExplorer.

A normal probability plot is a graphical technique for assessing whether data is normallydistributed. Data that is normally distributed falls on the ideal normal line, while data thatdeparts from a normal distribution is shown by departures from the ideal normal line.Figure 3b shows that XC Correction, Communicate energies, Diagonalize, andDistribute results contain MPI tasks with timings outside statistical normality; this isin sharp contrast to precondition, whose data points falls on the ideal normal line. Theload imbalance in Diagonalize arises because the dense diagonalization is performed on asubset of cores which form the 2×2 process grid. The load imbalance in XC Correctionarises from exchange–correlation PAW corrections which are evaluated around each atom. In ourcurrent implementation, this computational work is local to domains containing atoms; thus loadimbalance is present due to idle processes when the number of MPI tasks (=B×G) is greater thanNa. Note that this computation is also replicated across band groups. In subsequent figures, weabsorb Communicate energies into XC Correction and Distribute results intoDiagonalize for the sake of clarity and simplicity.

4.3. Mapping and DCMF environment variables

In order to minimize time-to-solution, we gathered timings for bulk gold 4×4×8 at 2048 nodeswhile varying the values of the mapping, buffer size parameter, and Deep Computing MessagingFramework (DCMF) [51] protocol variables. The standard mappings supported are permutationsof XYZT, with T located either at the beginning or end of the mapping specification string. Tcan be thought of the dimension of parallelism available within the node and is equal to thenumber of MPI tasks per node (T = 4 for VN mode). While it is possible to specify a nonstandardmapping using a mapfile, it is most convenient for typical users to work with the standard mappingsinstead. By design, the MPI tasks in the GPAW’s world communicator are ordered so that the bandindex is incremented last. The ordering of MPI tasks in GPAW’s world communicator are fixedand not re-ordered as a function of the partition dimension. We have found that this approach issufficient for scaling calculations that use simultaneously parallelization on bands and domains, buthas shortcomings when parallelization is invoked on three or more layers of parallelization. Forcalculations which have simultaneous parallelization on bands and domains in conjunction withspins, k-points, and/or images, mapping should be accomplished by using a mapfile.

The 2048-node partitions on the Argonne Leadership Computing Facility (ALCF) BlueGene/P have dimensions XYZT = 8×8×32×4 and can execute up to 8192 MPI tasks. For ourcomputational experiments, we choose simultaneous parallelization on domains and bands with{Gi} = (8, 8, 16) and B = 8 which uses all the cores on the 2048-node partition. Table V showsthat for all cases, the MPI eager protocol yields the shortest time; this is because our parallelmatrix multiply algorithm is able to overlap some communication and computation to the extentthat is allowable by the hardware. It was discovered, somewhat counter-intuitively, that there isperformance degradation when increasing buffer size from 2048 KiB to 4096 KiB. We were able toconfirm for select cases that this was due to the default value of DCMF RECFIFO SIZE, which is8192 KiB per node. Because there are four MPI tasks per node in VN mode, four times the parallel



(a) Wall-clock time breakdown

(b) Normal probability plot at 1024 MPI tasks.

Figure 3. Ten SCF iterations of a ground-state DFT calculation on bulk gold 2×4×4 with blocksize = 10and B = 1. A complete set of timers and relevant DFT calculation parameters are listed Tables III and IV,

respectively.

matrix multiply buffer size must be a lower bound for the value of DCMF RECFIFO – thus abuffer size of 2048 KiB can already fill up the default DCMF RECFIFO buffer. It would be possible



to benefit from a larger buffer size value, but only at the expense of a larger DCMF RECFIFO buffervalue which would reduce the total amount of allocatable memory to GPAW.

Table V. Inclusive timings for SCF cycle (10 SCF iterations excluding LCAO initialization) of the bulkgold 4×4×8 case at 2048 nodes as a function of mapping, parallel matrix multiply buffer size, and DCMFprotocol. The ScaLAPACK process grid was incidentally set to 6×6 instead of the value noted in Table III.The three DCMF protocols are eager, rendezvous (rzv) and optimized rendezvous (optrzv) [48]. Data wascollected in virtual node mode (four MPI tasks per node) with DCMF EAGER set to 8192 KiB. Mapping

where the last index is equal to B = 8 are indicated with an asterisk (∗).

Mapping Permuted Partition buffer size Timing (seconds)Dimensions (kilobytes) eager rzv optrzv

XYZT 8×8×32×42048 592 619 6104096 612 628 629

TXYZ 4×8×8×322048 471 532 5204096 500 532 527

TZYX∗ 4×32×8×82048 462 479 4784096 459 532 628

Lastly, we see that the largest performance enhancement comes from an optimal mapping ofMPI tasks unto the torus network. This optimal mapping can be achieved when B is identical tothe last index of the Blue Gene/P mapping. While in principle an optimal mapping could alwaysbe obtained on a four-dimensional network, architectural constraints prevent the use of certainvalues of {Gi} and B. The most obvious constraint is the total number of available cores pernode (T dimension) and the available set of partition shapes and sizes (X, Y, and Z dimensions)available on the Blue Gene/P at ALCF. We illustrate this concept using a 512-node partition forsimplicity in Figure 4. This figure depicts a 512-node partition with cartesian dimensions 8×8×8.Figure 4a shows a decomposition of B = 4 along one of the partition dimensions. As discussedin Section 3.3, our parallel matrix multiply involves a large volume of communication in a one-dimensional systolic ring pattern. There are large messages sent using MPI Isend and MPI Irecvbetween each MPI process and its adjacent MPI process in band comm. This communicator is notexplictly shown, but is defined to be the set of MPI processes forming columns perpendiculator tothe colored plane. In Figure 4a, band comm consist of the perpendicular column of MPI processesfrom every other (i.e., second nearest neighbor) colored plane of nodes. On the other hand, Figure 4bshows a decomposition of B = 8, where band comm consists of the perpendicular column of MPIprocess from adjacent (i.e., first nearest neighbor) colored planes of nodes. The one-dimensionalsystolic communication pattern in the B = 4 case leads to hopping over a plane of nodes and thussignificant network contention. The B = 8 decomposition requires no hopping over plane of nodesand leads to much lower network contention.

4.4. Performance data for bulk gold

We collected strong-scaling performance data for our four bulk gold benchmark test cases. Theparameters for each of the cases are given in Table III. For each case, we created a stacked-barand fractional stacked-bar chart using TAU’s PerfExplorer. The stacked-bar chart shows both thestrong-scaling performance and the wall-clock time breakdown, while the fractional stacked-barchart shows only the percentage wall-clock time breakdown. The fractional stacked-bar chart canbe considered a magnified view of the stacked-bar chart view that more clearly depicts fine graindetails. In the stacked-bar chart representation, ideal strong-scaling is shown by bars which arereduced proportionately between MPI tasks. For example, between 64 and 128 MPI tasks, the barsfor each computational kernel would go down by a factor of 2 in the limit of ideal scalability.In the fractional stacked-bar chart representation, ideal strong-scaling performance is shown by aconstant fraction of the wall-clock time with varying MPI tasks. These two representations for theperformance data are complementary and helpful in interpreting our results. Unless noted otherwise,



(a) Bad mapping for B = 4 (b) Good mapping for B = 8

Figure 4. Mappings available on a 512-node partition of Blue Gene/P with cartesian dimensions equal to8×8×8. Cubes represent computer nodes. Nodes with the same color form planes of MPI processes that

exist on the same domain comm; band comm is orthogonal to these planes and is not shown.

our results were obtained with blocksize set to 10, buffer size set to 2048 KiB, DCMF EAGER setto 8192 KiB, and B set to the last index of the Blue Gene/P mapping (whenever this was possiblewith the given partition).

Our smallest test case is 2×4×4 which was discussed in greater detail in Section 4.2. Weonly summarize the results here. Figure 5b shows that 74% of the inclusive SCF-cycle timeis spent on Hψng products (sum of Apply hamiltonian, projections, precondition,RMM-DIIS), which are either O(NbNg) or O(NbNp) operations, while only about 24% is spent inparallel matrix multiplications (sum of calc s matrix, calc h matrix, and rotate psi),which are O(N2

bNg) operations.∗∗ As the 2×4×4 case is strong-scaled to a larger numberof MPI tasks, the bottlenecks emerging are XC correction arising from load imbalance,Diagonalize arising from Amdahl’s Law, and precondition arising from a surface-to-volume effect.

The dominant computational kernels in the 4×4×4 case are very similar to those in the 2×4×4.However, it is worth noting that strong-scaling bottlenecks are different for 4×4×4 than for2×4×4. Figure 6b shows that XC correction and Diagonalize are still an issue, but thatprecondition is no longer as relevant even though Ngi/Gi ∼ 10 for i ∈ 1, 2, 3 per MPI task.In addition, there are a number of subtle bottlenecks that appear at 4096 MPI tasks (B = 16).For instance, Poisson and Inverse Cholesky no longer take a negligible fraction of theinclusive SCF-cycle time, which is consistent with Amdahl’s Law. We identified these inherentlimitations of our parallelization approach in Section 3.1 (see Table II); namely, that the Poissoncomputation is replicated for each group of bands and that Inverse Cholesky is an operationthat is constrained to the concurrency available in the ScaLAPACK process grid which is only4× 4 = 16 MPI tasks for this test case.

The 4×4×8 case is the second largest test case we consider and is significantly different fromthe 2×4×4 with respect to the dominant computational kernels. Figure 7b shows that 50% of theinclusive SCF-cycle time is spent in parallel matrix multiplies, while 46% is spent in Hψng

products. This test case is interesting because it marks a transition region where the dominantcomputational kernels shift from the Hψng, which are O(N2) in computational complexity, toparallel matrix multiplies which areO(N2

bNg). As the 4× 4× 8 case is strong-scaled out, the largestbottlenecks are XC Correction and Diagonalize, but Inverse Cholesky is also moreprominent.

∗∗These percentages can be approximately obtained from visual inspection of the fraction stacked-bar chart.



(a) Stacked-bar chart

(b) Fractional stacked-bar chart

Figure 5. Strong-scaling performance data as obtained by timers for 10 SCF iterations of a ground-stateDFT calculation on bulk gold 2×4×4. A complete set of relevant DFT calculation parameters and timers are

listed Tables III and IV, respectively.



Our largest test case investigated in this study is the 4×8×8 shown in Figure 8. We observe that58% of the inclusive SCF-cycle time is spent in parallel matrix multiplies, while only 32% isspent in Hψng products. As the 4×8×8 test case is strong-scaled out, we see in Figure 8b thatDiagonalize is the dominant computational kernel at 131,072 MPI tasks. In addition, at thisvery large scale, we discover that our parallel matrix multiplies are not scaling as well as in thepreviously discussed smaller test case (4×4×8). This is most easily seen in Figure 7a by comparingthe exclusive time of calc h matrix, calc s matrix, and rotate psi between 65,536and 131,072 MPI tasks. We speculate that the poor scaling comes from the mapping not beingcommensurate with the value of B=64 at 131,072 MPI tasks. This hypothesis is based on ourprevious observations of timings for 4×4×8 test case which exhibited a 28% performance penalty(compare timings for buffer size = 2048 for eager between XYZT and TZYX in Table V). Currently,it is not possible to accommodate an ideal mapping for B > 32 with the available partitions on theBlue Gene/P at the ALCF.

It is evident from Figures 7–8 that Diagonalize is the predominant factor for the poorperformance in the two largest test cases. The subspace diagonalization in GPAW is performedby ScaLAPACK’s Divide and Conquer dense diagonalization routine (PDSYEVD) [52]. Its weak-scaling performance is shown in Figure 9, where we plot the ideal versus the observed wall-clocktime. This behavior is consistent with the findings of Sunderland [53] and Petschow [54]. While wehave not investigated the source of the poor performance of PDSYEVD on Blue Gene/P, we notethat Petschow identifies the reduction to tridiagonal form as a severe bottleneck when the uppertriangle of the input matrix is referenced – which is the current implementation GPAW. However,the performance issue note by Petschow et. al. becomes prevalent only at matrix sizes larger than10,000, which is almost a factor of two larger than what is studied in this paper.



0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256ScaLAPACK processor grid

0

2

4

6

8

10

12

14

16

Dia

gonal

ize

tim

e (n

orm

aliz

ed u

nit

s)

712

1424

2848

5696

Figure 9. Weak-scaling performance of PDSYEVD for bulk gold test cases as a function of the ScaLAPACKprocess grid. The value of Nb is annotated adjacent to the data point (blue crosses). The data points are theDiagonalize time normalized to the value at Nb = 712. Worse performance is indicated by data points

(blue crosses) above the ideal line (black).

5. CONCLUSIONS

The scientific need for DFT calculations on increasingly larger system sizes as well as theavailability of massively parallel supercomputers has driven the development of more scalablealgorithms. The GPAW data structures are completely scalable with respect to memory with thepossible exception of O(Na) arrays. Our real-space implementation of PAW method is parallelizedover all quantum mechanical indices: k-points, spins, bands and domains. We have examined theweak and strong-scaling behavior of several bulk gold test case ranging in size from Nb=712 toNb=5696.

We determined that proper mapping of MPI tasks on the Blue Gene/P torus is an important aspectof obtaining good performance. ForNb∼500−2000, the dominant part of the inclusive SCF-cycletime was Hψng products which contain all the O(N2) terms. For Nb>2000, the dominant part ofthe SCF-cycle are parallel matrix multiplies with computational complexity O(N2

bNg). On alltest cases investigated, Diagonalize was an Amdahl’s Law bottleneck, consistent with findingsof Kent [27]. AsNb increases, terms withO(N3

b ) computational complexity become an increasinglysevere bottleneck for weak- and strong-scaling. A less severe, but non-neglible, strong-scalingbottleneck included XC correction, which suffers from load imbalance as well as computationreplicated across the different band groups (B-axis). For the largest calculation presented in thiswork, our parallel matrix multiply algorithm exhibited poor performance at 131,072 MPI tasks,which was caused by the inability to obtain an optimal mapping on the Blue Gene/P partition.A possible way to improve upon our current parallel matrix multiply algorithm is to replace ourMPI Isend/MPI Irecv calls with MPI Bcast, which would make better use of the Blue Gene/P three-dimensional torus network.



We have shown that ground-state DFT calculations up to Na∼1000 with Nb>5000 are feasiblewith massively parallel supercomputers such as Blue Gene/P. Similarly large calculations havebeen around since at least 2006, where F. Gygi et. al. [55] performed similarly large calculationson Blue Gene/L which result in resulted in a Gordon Bell Award. The major differences beingthe use of uniform real-space grids instead of a plane-wave basis and the use of PAW insteadof norm-conserving pseudopotentials. Kleis et al. [56] performed total energy calculations ongold nanoparticles containing as many as 1415 atoms (Nb > 8064) using GPAW on ALCF’s BlueGene/P. More recently, Lin et. al [57] performed similarly large scale DFT calculations on platinumnanoparticles. While the work presented here focuses on the Blue Gene/P architecture, manyarchitectures have benefited from the parallelization strategy presented here; most notably, CrayXT5 (see Figure 14 in Reference [5]), and even commodity Linux clusters. The size of the largesttest case investigated in this work is near the practical limits accessible by O(N3) DFT methods.While it is possible to push this limit further with diagonalization-free approaches in conjunctionwith more scalable parallel dense linear algebra algorithms, robust O(N) DFT methods are neededto make calculations onNb∼104 practical for time-sensitive calculations such as ab initio moleculardynamics.

When going beyond ground-state DFT calculations, new possibilities for utilizing massivelyparallel supercomputers arise. For example, time-dependent DFT (both in linear response and real-time propagation forms), GW-approximation, and Bethe-Salpeter equation in GPAW have beenparallelized with additional degrees of freedom with minimal communication requirements. Theseare also computationally much more demanding, and are thus currently limited to much smallersystem sizes than ground-state studies. The optimization work reported here also paves the way forenabling these beyond-DFT schemes in much larger parallel scale than previously possible.

ACKNOWLEDGMENTS

We thank Marcin Dułak from the Center for Atomic-scale Materials Design (CAMd) for the initialporting of GPAW to the Blue Gene/P at the Argonne Leadership Computing Facility (ALCF).We thank Nils Smeds (IBM Sweden) and William Scullin (ALCF) for useful discussions onPython-related issues on the Blue Gene/P architecture. We thank John Linford from ParaTools,Inc. and Kevin Huck from University of Oregon for assistance in generating the figures usingTAU’s PerfExplorer. We acknowledge Kalyan Kumaran (ALCF) and Jeff Greeley (PurdueUniversity) for their support of this project. We are grateful to Carolyn M. Steele from ArgonneNational Laboratory’s (ANL’s) Communications, Education and Public Affairs (CEPA) division forreviewing and editing this manuscript. This work has been supported by the Academy of Finland(Project 110013 and the Center of Excellence program) and Tekes MASI-program. We acknowledgesupport from the Danish Center for Scientific Computing (DCSC). CAMd is sponsored by theLundbeck Foundation. The research at the University of Oregon was supported by grants DOEER26057, ER26167, ER26098, and ER26005 from the U.S. Department of Energy, Office ofScience. This research used resources of the ALCF at ANL, which is supported by the Officeof Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. A.H.L.acknowledges support from the European Research Council Advanced Grant DYNamo (ERC-2010-AdG Proposal No. 267374) and Grupo Consolidado UPV/EHU del Gobierno Vasco (IT578-13).

A. BENCHMARK INPUT FILE TEMPLATE

# Perform a single point total energy calculationfrom ase import Atomsfrom gpaw import GPAW, Mixer, ConvergenceError, PoissonSolverfrom gpaw.poisson import PoissonSolverfrom gpaw.utilities.timing import Timer, TAUTimerfrom gpaw.eigensolvers.rmm_diis import RMM_DIIS



ps = PoissonSolver(nn=’M’, relax=’GS’, eps=1e-9)es = RMM_DIIS(keep_htpsit=False, blocksize=10)

# dimensionsL1 = <value1>L2 = <value2>L3 = <value3>

# systemnbandspercell = 22.25a = 4.08bulk = Atoms(’Au4’,

positions=((0, 0 ,0),(0.5, 0.5, 0),(0.5, 0, 0.5),(0, 0.5, 0.5)),

pbc=True)

bulk.set_cell((a, a, a), scale_atoms=True)bulk = bulk.repeat((L1, L2, L3))

calc = GPAW(h=0.1275,maxiter=10,mode=’fd’,poissonsolver=ps,nbands=int(nbandspercell * L1 * L2 * L3),spinpol=False,xc=’LDA’,width=0.01,mixer=Mixer(0.10, 5, 100.0),eigensolver=es,parallel={’buffer_size’:2048},txt=’Au_bulk.out’)

bulk.set_calculator(calc)

try:bulk.get_potential_energy()

except ConvergenceError:pass

B. PARALLEL MATRIX MULTIPLY

There are two different types of parallel matrix multiplication routines needed by the RMM-DIISalgorithm. Although both types have computational complexity of O(N2

bNg), different routines areneeded because of the distinct shapes and logical distributions for the input matrices. One routineis needed for the construction of Hmn, Smn, and another one for performing rotation operations onψng. The former has the general form of Omn = 〈ψm|O|ψn〉, where O is a matrix-free operator. Letφng denote the matrices resulting from O|ψn〉. We denote the row and column coordinates of theMPI task in the two-dimensional process grid by (r, c). Additionally, letBr andGc denote the rangeof indices of ψng associated with MPI task (r, c).



We calculate Omn in terms of an auxiliary matrix denoted O1D[q](r, c), where each entry q isa Nb/B ×Nb/B-size subblock of Omn. The superscript reminds us that this matrix contains one-dimensional slices of the Omn matrix. Since Omn is symmetric, only q = B div 2 exchanges areneeded to build up all the needed subblocks to form the lower triangle of Omn.†† The capability toonly compute half the matrix in parallel has substantial computational savings and is not currentlyavailable in any parallel dense linear algebra package.

Our parallel matrix multiply algorithm proceeds as follows:

1. Initially, without any communication, each MPI task can calculate its diagonal blockcontribution to Omn, which is given by

O1D[0](r, c) =

Gc∑g

φ∗mgψmg∆V, {m ∈ Br}, (6)

where ∆V is the volume element for the real-space grid. Then each MPI task in band commexchanges φng with the nearest neighbor rank and computes the next contribution. In general,for the q-th wave function exchange, we have

O1D[q](r, c) =

Gc∑g

φ∗mgψng∆V, {m ∈ B(r+q)divB , n ∈ Br}. (7)

This phase of the communication uses MPI Isend and MPI Irecv. This allows us to overlapthe computation phase ofO[q](r, c) with the communication ofO[q + 1](r, c) for q < B div 2.Instead of exchanging the entirety of φng in a single pass, a buffer size parameter is availableto control the chunk of ψng that is sent and received (not explictly shown in the equations).While this additional layer of complexity is tedious to code, the additional memory savingsare beneficial on low-memory nodes like those found on Blue Gene/P.

2. Summing along the columns of the two-dimensional process grid (MPI Reduce ondomain comm), we have

O1D[q](r, c = 0) =∑c

O1D[q](r, c). (8)

3. At this step, we have all the necessary blocks of Omn, but they are not distributed in a formthat is amenable to use with ScaLAPACK. This requires two additional steps.

(a) Since the necessary subblocks are not on the needed task on the B×G process grid,an MPI-based routine assembles the blocks of O1D[q](r, c = 0) into a one-dimensionalcolumn-wise block layout: O1D

mn(r, c = 0) := Omn(r, c = 0), {∀m,n ∈ Br}(b) A ScaLAPACK routine call redistributes O1D

mn(r, c) to the two-dimensional block cycliclayout required by the ScaLAPACK library. The dimensions of the ScaLAPACK processgrid are a user specified parameter and independent of the B×G process grid. After theScaLAPACK-based operations are completed, a second call to a ScaLAPACK routineis required to return the resulting matrix to its previous one-dimensional column-wiseblock layout.

The second type of parallel matrix multiply has the general form of ψ′mg =∑

n Umnψng

where Umn is a unitary matrix. Our parallel matrix multiply algorithm is similar to the work ofSolomonik [42] and proceeds as follows:

1. Umn is computed by ScaLAPACK in its native two-dimensional block cyclic layout as theresult of subspace diagonalization or the Cholesky factorization plus triangular inversion.

††In general, for odd values of B greater than unity, exactly the lower triangle of Omn is computed. For even values ofB greater than two, a few subblocks in addition to the lower triangle are computed. In the limit of large even B, thisadditional computation is negligible.



2. Using a ScaLAPACK routine, Umn is redistributed from a two-dimensional block cycliclayout into a one-dimensional row-wise block layout that only exists on the root node ofdomain comm: U1D

mn(r, c = 0) := Umn(r, c = 0), {m ∈ Br,∀n}.3. U1D

mn(r, c = 0) is broadcast along the columns of the two-dimensional process grid(MPI Broadcast on domain comm). We define a new array U1D[q](r) in a manner reminiscentof our previously described matrix multiply. The c index is dropped as the matrix is replicatedon the G-axis and the q index is introduced as this matrix is accesssed in Nb/B×Nb/Bsubblocks.

4. Each MPI task calculates its local contribution to the rotated wave functions:

ψ′mg(r, c) =

Br∑n

U1D[0](r)ψng(r, c), {m ∈ Br, g ∈ Gc}. (9)

Since symmetry is not as easily exploitable here, q = B − 1 exchanges are required to rotatethe wave functions. In general, the q-th exchange is given by

ψ′mg(r, c) = ψ′mg(r, c) +

Br∑n

U1D[q](r)ψng(r, c), {m ∈ Br, g ∈ Gc}, (10)

where the ψ′mg on the right-hand side contains the cumulative contributions from the previousq − 1 exchanges. The communication pattern here closely follows the prior algorithm withrespect to the overlapping computation and communication with MPI Isend and MPI Irecv.We also use the buffer size parameter to control the chunk of ψng that is sent and received.

We schematically depict the the distinct MPI operations on the rows and columns of the processgrid below:

Table VI. MPI operations in parallel matrix multiply routines

Operation Before Afterband comm band comm

ISend/IRecvrank 0 rank 1 rank 2ψB0Gc ψB1Gc ψB2Gc

rank 0 rank 1 rank 2ψB2Gc ψB0Gc ψB1Gc

domain comm domain comm

Reducerank 0 rank 1 rank 2

O[q](r, 0) O[q](r, 1) O[q](r, 2)rank 0 rank 1 rank 2∑

cO[q](r, c)

domain comm domain comm

Bcastrank 0 rank 1 rank 2

U1Dmn(r, 0)

rank 0 rank 1 rank 2U1Dmn(r, 0) U1D

mn(r, 0) U1Dmn(r, 0)

REFERENCES

1. Hohenberg P, Kohn W. Inhomogeneous electron gas. Phys. Rev. 1964; 136:B864–B871.2. Kohn W, Sham LJ. Self-consistent equations including exchange and correlation effects. Phys. Rev. 1965;

140:A1133–A1138.3. Amdahl G. Validity of the single processor approach to achieving large-scale computing capabilities. American

Federation of Information Processing Societies Conference Proceedings 1967; 30:225–231.4. Mortensen JJ, Hansen LB, Jacobsen KW. Real-space grid implementation of the projector augmented wave method.

Phys. Rev. B Jan 2005; 71:035 109, doi:10.1103/PhysRevB.71.035109.5. Enkovaara J, Rostgaard C, Mortensen JJ, Chen J, Dułak M, Ferrighi L, Gavnholt J, Glinsvad C, Haikola V, Hansen

HA, et al.. Electronic structure calculations with GPAW: a real-space implementation of the projector augmented-wave method. J. Phys.: Condens. Matter 2010; 22:253 202.



6. Chelikowsky JR, Troullier N, Saad Y. Finite-difference-pseudopotential method: Electronic structure calculationswithout a basis. Phys. Rev. Lett. 1994; 72:1240–1243.

7. Briggs EL, Sullivan DJ, Bernholc J. Large-scale electronic-structure calculations with multigrid acceleration. Phys.Rev. B 1995; 52:R5471–R5474.

8. Blochl PE. Projector augmented-wave method. Phys. Rev. B 1994; 50:17 953–17 979.9. Walter M, Hakkinen H, Lehtovaara L, Puska M, Enkovaara J, Rostgaard C, Mortensen JJ. Time-dependent

density-functional theory in the projector augmented- wave method. J. Chem. Phys. 2008; 128(24):244101, doi:10.1063/1.2943138. URL http://link.aip.org/link/?JCP/128/244101/1.

10. Yan J, Mortensen JJ, Jacobsen KW, Thygesen KS. Linear density response function in the projector augmentedwave method: Applications to solids, surfaces, and interfaces. Phys. Rev. B Jun 2011; 83:245 122, doi:10.1103/PhysRevB.83.245122. URL http://link.aps.org/doi/10.1103/PhysRevB.83.245122.

11. Rostgaard C, Jacobsen KW, Thygesen KS. Fully self-consistent gw calculations for molecules. Phys. Rev. BFeb 2010; 81:085 103, doi:10.1103/PhysRevB.81.085103. URL http://link.aps.org/doi/10.1103/PhysRevB.81.085103.

12. Yan J, Jacobsen KW, Thygesen KS. Optical properties of bulk semiconductors and graphene/boron nitride:The bethe-salpeter equation with derivative discontinuity-corrected density functional energies. Phys. Rev. BJul 2012; 86:045 208, doi:10.1103/PhysRevB.86.045208. URL http://link.aps.org/doi/10.1103/PhysRevB.86.045208.

13. Larsen AH, Vanin M, Mortensen JJ, Thygesen KS, Jacobsen KW. Localized atomic basis set in the projectoraugmented wave method. Phys. Rev. B 2009; 80:195 112.

14. Larsen AH. Efficient electronic structure methods applied to metal nanoparticles. PhD Thesis, Technical Universityof Denmark November 2011.

15. Enkovaara J, Romero NA, Shende S, Mortensen JJ. GPAW - massively parallel electronic structure calculationswith Python-based software. Procedia Computer Science (2011) 2011; 4:17–25.

16. Gropp W, Lusk E, Skjellum A. Using MPI, 2nd Edition: Portable Programing with the Message Passing Interface.The MIT Press, 1999.

17. Gropp W, Lusk E, Skjellum A. Using MPI-2: Advanced Features of the Message-Passing Interface. The MIT Press,1999.

18. IBM Blue Gene team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development 2008;52:199–220.

19. Baumeister P. Real-Space Finite-Difference PAW Method for Large-Scale Applications on MassivelyParallel Computers, Reihe Schlsseltechnologien / Key Technologies, vol. 53. Foschungszentrum JlichGmbH Zentralbibliothek, Verlag: Jlich, 2012. URL http://juser.fz-juelich.de/record/128376,schriftenreihen des Forschungszentrum Jlich; Zugl.: Aachen, RWTH, Diss., 2012.

20. Bottin F, Leroux S, Knyazev A, Zerah G. Large-scale ab initio calculations based on three levels of parallelization.Computational Materials Science 2008; 42(2):329 – 336, doi:10.1016/j.commatsci.2007.07.019.

21. Hutter J, Curioni A. Dual-level parallelism for ab initio molecular dynamics: Reaching teraflop performance withthe CPMD code. Parallel Computing 2005; 31:1.

22. Hutter J, Curioni A. Car-parrinello Molecular Dynamics on Massively Parallel Computers. ChemPhysChem 2005;6:1788.

23. Gygi F. Architecture of Qbox: A scalable first-principles molecular dynamics code. IBM journal of Research andDevelopment January/March 2008; 52(1/2):1.

24. Giannozzi P, Cavazzoni C. Large-scale computing with Quantum ESPRESSO. Nuovo Cimento C 2009; 32:49.25. Bylaska EJ, Glass K, Baxter D, Baden SB, Weare JH. Hard scaling challengers for ab initio molecular

dynamics capabilities in NWChem: Using 100, 000 cpus per second. Journal of Physics: Conference Series 2009;180:012 028.

26. Wang LW. Https://hpcrd.lbl.gov/ linwang/PEtot/PEtot.html.27. Kent PRC. Computational challenges of large-scale long-time first-principles molecular dynamics. Journal of

Physics: Conference Series 2008; 125:012 058, doi:10.1088/1742-6596/125/1/012058.28. Kresse G, Furthmuller J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis

set. Phys. Rev. B 1996; 54:11 169–11 186.29. Wood DM, Zunger A. A new method for diagonalising large matrices. J. Phys. A: Math. Gen. 1985; 18(9):1343.30. Payne MC, Teter MP, Allan DC, Arias TA, Joannopoulos JD. Iterative minimization techniques for ab initio total-

energy calculations: molecular dynamics and conjugate gradients. Rev. Mod. Phys. Oct 1992; 64(4):1045–1096.31. Feng YT, Owen DRJ. Conjugate gradient methods for solving the smallest eigenpair of large symmetric eigenvalue

problems. Int. J. Num. Meth. in Engineer. 1996; 39(13):2209–2229.32. Davidson ER. The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large

real-symmetric matrices. J. Comp. Phys. 1975; 17:87–94.33. Stewart G. Communication and matrix computations on large message passing systems. Parallel Computing 1990;

16:27–40.34. Dongarra J, van de Geijn R, Walker D. Scalability issues affecting the design of a dense linear algebra library.

Journal Parallel and Distributed Computing 1994; 22(3):523–537.35. Hendrickson BA, Womble DE. The torus-wrap mapping for dense matrix calculations on massively parallel

computers. SIAM Journal of Scientific and Statistical Computing 1994; 15(5):1201–1226.36. Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling

S, McKenney A, et al.. LAPACK Users’ Guide. Third edn., Society for Industrial and Applied Mathematics:Philadelphia, PA, 1999.

37. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, PetitetA, et al.. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics: Philadelphia, PA, 1997.

38. Thorton WS, Vence N, Harrison R. Introducing the MADNESS numerical framework for petascale computing.Proceedings of the Cray Users Group 2009; .



39. Nieplocha J, Palmer B, Tipparaju V, Krishnan M, Trease H, Apra E. Advances, applications and performance ofthe global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 2006; 20(2):203–231.

40. Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Petitet A, Pozo R,et al.. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Soft. 2002; 28-2:135–151.

41. Engineering and Scientific Subroutine Library (ESSL). URL http://www-03.ibm.com/systems/software/essl/.

42. Solomonik E, Demmel J. Communication-optimal parallel 2.5d matrix multiplication and LU factorizationalgorithms. Technical Report UCB/EECS-2011-10, University of California, Berkeley Feb 2011.

43. M D Schatz JP, Van De Geijn RA. Scalable universal matrix multiplication algorithms: 2d and 3d variations on atheme. Technical Report, University of Texas at Austin 2012. URL http://www.cs.utexas.edu/users/flame/pubs/SUMMA2d3dTOMS.pdf.

44. Malony A, Shende S. Performance Technology for Complex Parallel and Distributed Systems. Distributed andparallel systems: from instruction parallelism to cluster computing 2000; :37–46.

45. Shende S, Malony A, Cuny J, Lindlan K, Beckman P, Karmesin S. Portable Profiling and Tracing for ParallelScientific Applications using C++. Proceedings 2nd SIGMETRICS Symposium on Parallel and Distributed Tools(SPDT’98), 1998; 134–145.

46. Shende S, Malony A, Ansell-Bell R. Instrumentation and Measurement Strategies for Flexible and PortableEmpirical Performance Evaluation. Proceedings Tools and Techniques for Performance Evaluation Workshop,PDPTA, vol. 3, CSREA, 2001; 1150–1156.

47. Shende S, Malony AD. The TAU parallel performance system. The International Journal of High PerformanceComputing Applications Summer 2006; 20(2):287–311. URL http://www.cs.uoregon.edu/research/tau.

48. Sosa C, Knudson B. IBM System Blue Gene Solution: Blue Gene/P Application Development. IBM Redbooks,2009.

49. Foster I. Designing and Building Parallel Programs. Addison Wesley, 1995.50. Chambers JM, Cleveland WS, Tukey PA, Kleiner B. Graphical Methods for Data Analysis. Wadsworth, 1995.51. Kumar S, Dozsa G, Almasi G, Heidelberger P, Chen D, Giampapa ME, Blocksome M, Faraj A, Parker J, Ratterman

J, et al.. The Deep Computing Messaging Framework: generalized scalable message passing on the Blue Gene/psupercomputer. Proceedings of the 22nd annual international conference on Supercomputing, ICS ’08, ACM: NewYork, NY, USA, 2008; 94–103, doi:10.1145/1375527.1375544.

52. Tisseur F, Donagara J. A parallel divide and conquer algorithm for symmetric eigenvalue problem on distributedarchitectures. SIAM J. Sci. Comput. 1999; 6:20:223–2236.

53. Sunderland AG. Parallel diagonalization performance on high-performance computers. Parallel ScientificComputing and Optimization, vol. 27, Cieges R, Henty D, Kagstrom B, Zilinskas J (eds.). Springer, 2009; 57–66.

54. Petschow M, Peise E, Bientinesi P. High-performance solvers for dense hermitian eigenproblems. SIAM J. ScientificComputing 2013; 35.

55. Gygi F, Draeger EW, Schulz M, de Supinski BR, Gunnels JA, Austel V, Sexton JC, Franchetti F, Kral S,Ueberhuber CW, et al.. Large-scale electronic structure calculations of high-z metals on the bluegene/l platform.Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC ’06, ACM: New York, NY, USA, 2006,doi:10.1145/1188455.1188502. URL http://doi.acm.org/10.1145/1188455.1188502.

56. Kleis J, Greeley J, Romero N, Morozov V, Falsig H, Larsen A, Lu J, Mortensen J, Dułak M, Thygesen K, et al..Finite size effects in chemical bonding: From small clusters to solids. Catalysis Letters 2011; 141(8):1067–1071.

57. Li L, Larsen AH, Romero NA, Morozov VA, Glinsvad C, Abild-Pedersen F, Greeley J, Jacobsen KW, Nørskov JK.Investigation of catalytic finite-size-effects of platinum metal clusters. Journal of Physical Chemistry Letters 2013;4(1):222–226.


Design and performance characterization of electronic ... · 2 N. A. ROMERO ET AL. Typical DFT...

Documents

Transcript of Design and performance characterization of electronic ... · 2 N. A. ROMERO ET AL. Typical DFT...