An efficient parallel algorithm for the calculation of canonical MP2 energies

7
An Efficient Parallel Algorithm for the Calculation of Canonical MP2 Energies JON BAKER, PETER PULAY Parallel Quantum Solutions, 2013 Green Acres Road, Suite A, Fayetteville, Arkansas 72703 Department of Chemistry, University of Arkansas, Fayetteville, Arkansas 72701 Received 17 September 2001; Accepted 13 December 2001 Published online 00 Month 2002 in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/jcc.10071 Abstract: We present the parallel version of a previous serial algorithm for the efficient calculation of canonical MP2 energies (Pulay, P.; Saebo, S.; Wolinski, K. Chem Phys Lett 2001, 344, 543). It is based on the Saebo–Almlo ¨f direct-integral transformation, coupled with an efficient prescreening of the AO integrals. The parallel algorithm avoids synchronization delays by spawning a second set of slaves during the bin-sort prior to the second half-transformation. Results are presented for systems with up to 2000 basis functions. MP2 energies for molecules with 400 –500 basis functions can be routinely calculated to microhartree accuracy on a small number of processors (6 – 8) in a matter of minutes with modern PC-based parallel computers. © 2002 Wiley Periodicals, Inc. J Comput Chem 23: 1150 –1156, 2002 Key words: canonical MP2 energies; parallel algorithm; Saebo–Almlo ¨f integral transformation Introduction One of the most important events in the field of computational chemistry in recent years has been the development of the mass- market personal computer (PC) and the combination of individual PCs into large, multiprocessor systems, commonly known as Be- owulf or Linux clusters (after the freely available Unix-like oper- ating system that is usually installed on such clusters). 1 Both CPU and I/O speed in modern PCs are close or equal to those in much more expensive workstations or mainframes, and both are contin- ually improving. Linux clusters are currently the most cost-effec- tive means of doing a wide range of quantum chemistry compu- tations. In this article we report a parallel implementation of our recent serial algorithm for calculating canonical second-order Møller– Plesset perturbation theory (MP2) energies. 2 Prior to the advent of Density Functional Theory (DFT), 3,4 MP2 theory 5 was the sim- plest and least expensive way of incorporating electron correlation in ab initio electronic structure calculations. It still has certain advantages over DFT, for example, for hydrogen bonded systems, or when dispersion forces are important. 6 Single-point MP2 ener- gies are very useful for checking DFT barriers and relative ener- gies, agreement providing a good indication of the reliability of the DFT energetics. The closed-shell MP2 energy can be written as: 7 E MP2 ij e ij ij 2 ij a,b ai bj 2 ai bj bi aj / i j a b (1) where i and j denote doubly occupied molecular orbitals (MOs), a and b denote virtual (unoccupied) MOs, and are the correspond- ing orbital energies. The (aibj) are the two-electron repulsion integrals (in the usual Mulliken notation) over molecular orbitals. Virtually all of the computational work in calculating the MP2 energy is associated with the evaluation of the atomic (AO) inte- grals (), and their transformation into the MO basis (aibj). ai bj ,,, C a C i C b C j (2) This is typically accomplished via four quarter transformations (i.e., transforming one index at a time), resulting in a steep fifth- order formal scaling with molecular size. Parallelization of the four-index transformation, for MP2 and other correlation methods, has been discussed in several articles. 8 –16 Most existing algo- rithms parallelize over one of the transformation indices for each quarter transformation, and consequently, have a cubic scaling memory requirement. This necessitates multiple passes through the integrals, as only a limited number of electrons can be correlated at any one time, diminishing the efficiency of the algorithm. In contrast to parallel algorithms that aim at scaling to a very large number of processors, our code was primarily developed for Correspondence to: J. Baker; e-mail: [email protected] Contract/grant sponsor: Air Force Office for Scientific Research; contract/grant number: F49620-00-1-0281 Contract/grant sponsor: National Science Foundation (SBIR); contract/ grant number: DMI-9901737 (to Parallel Quantum Solutions) © 2002 Wiley Periodicals, Inc.

Transcript of An efficient parallel algorithm for the calculation of canonical MP2 energies

An Efficient Parallel Algorithm for the Calculation ofCanonical MP2 Energies

JON BAKER, PETER PULAYParallel Quantum Solutions, 2013 Green Acres Road, Suite A, Fayetteville, Arkansas 72703

Department of Chemistry, University of Arkansas, Fayetteville, Arkansas 72701

Received 17 September 2001; Accepted 13 December 2001Published online 00 Month 2002 in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/jcc.10071

Abstract: We present the parallel version of a previous serial algorithm for the efficient calculation of canonical MP2energies (Pulay, P.; Saebo, S.; Wolinski, K. Chem Phys Lett 2001, 344, 543). It is based on the Saebo–Almlofdirect-integral transformation, coupled with an efficient prescreening of the AO integrals. The parallel algorithm avoidssynchronization delays by spawning a second set of slaves during the bin-sort prior to the second half-transformation.Results are presented for systems with up to 2000 basis functions. MP2 energies for molecules with 400–500 basisfunctions can be routinely calculated to microhartree accuracy on a small number of processors (6–8) in a matter ofminutes with modern PC-based parallel computers.

© 2002 Wiley Periodicals, Inc. J Comput Chem 23: 1150–1156, 2002

Key words: canonical MP2 energies; parallel algorithm; Saebo–Almlof integral transformation

Introduction

One of the most important events in the field of computationalchemistry in recent years has been the development of the mass-market personal computer (PC) and the combination of individualPCs into large, multiprocessor systems, commonly known as Be-owulf or Linux clusters (after the freely available Unix-like oper-ating system that is usually installed on such clusters).1 Both CPUand I/O speed in modern PCs are close or equal to those in muchmore expensive workstations or mainframes, and both are contin-ually improving. Linux clusters are currently the most cost-effec-tive means of doing a wide range of quantum chemistry compu-tations.

In this article we report a parallel implementation of our recentserial algorithm for calculating canonical second-order Møller–Plesset perturbation theory (MP2) energies.2 Prior to the advent ofDensity Functional Theory (DFT),3,4 MP2 theory5 was the sim-plest and least expensive way of incorporating electron correlationin ab initio electronic structure calculations. It still has certainadvantages over DFT, for example, for hydrogen bonded systems,or when dispersion forces are important.6 Single-point MP2 ener-gies are very useful for checking DFT barriers and relative ener-gies, agreement providing a good indication of the reliability of theDFT energetics.

The closed-shell MP2 energy can be written as:7

EMP2 � �i�j

eij � �i�j

�2 � �ij� �a,b

�ai�bj��2�ai�bj� � �bi�aj��/

��i � �j � �a � �b� (1)

where i and j denote doubly occupied molecular orbitals (MOs), aand b denote virtual (unoccupied) MOs, and � are the correspond-ing orbital energies. The (ai�bj) are the two-electron repulsionintegrals (in the usual Mulliken notation) over molecular orbitals.Virtually all of the computational work in calculating the MP2energy is associated with the evaluation of the atomic (AO) inte-grals (����), and their transformation into the MO basis (ai�bj).

�ai�bj� � ��,�,�,

C�aC�iC�bCj������ (2)

This is typically accomplished via four quarter transformations(i.e., transforming one index at a time), resulting in a steep fifth-order formal scaling with molecular size. Parallelization of thefour-index transformation, for MP2 and other correlation methods,has been discussed in several articles.8–16 Most existing algo-rithms parallelize over one of the transformation indices for eachquarter transformation, and consequently, have a cubic scalingmemory requirement. This necessitates multiple passes through theintegrals, as only a limited number of electrons can be correlatedat any one time, diminishing the efficiency of the algorithm.

In contrast to parallel algorithms that aim at scaling to a verylarge number of processors, our code was primarily developed for

Correspondence to: J. Baker; e-mail: [email protected]

Contract/grant sponsor: Air Force Office for Scientific Research;contract/grant number: F49620-00-1-0281

Contract/grant sponsor: National Science Foundation (SBIR); contract/grant number: DMI-9901737 (to Parallel Quantum Solutions)

© 2002 Wiley Periodicals, Inc.

small and medium (4–64 processor) Linux clusters, complement-ing our parallel SCF and DFT codes. Scaling to massively parallelsystems, although desirable as a research project, is in our opinionnot fully practicable at this time. Most large computing installa-tions have a number of demanding jobs running simultaneously,and assigning a large number of processors to a single job requiresthat the other jobs wait in queue, negating the advantages ofparallelism.

We use the Parallel Virtual Machine (PVM) toolkit for paral-lelization, which is a message-passing protocol developed byOakridge National Laboratory and the University of Tennessee,17

and our code can routinely handle systems with upwards of 100atoms and 1000 basis functions. Creating a Message Passinginterface (MPI) version is also straightforward.

The Serial Algorithm

The serial MP2 algorithm is based on the Saebo–Almlof direct-integral transformation,18 together with an efficient prescreeningof the AO integrals. The Saebo–Almlof algorithm is unique in thatits fast memory requirement scales only quadratically with thebasis size. However, this is at the expense of sacrificing much ofthe permutational symmetry of the AO integrals, which may bewhy it has not been widely considered in the past. To our knowl-edge, apart from the original implementation,18 it has been usedonly in the “sixfold” algorithm (which reduces to “fourfold” forMP2) of Wong et al.,13 and in our recent implementation.2

In the Saebo–Almlof method, the AO integrals are transformedvia two half-transformations, involving first the two occupied MOsand then the two virtuals. If, for example, the integral (����) isconsidered as the (�) element of a generalized exchange matrixX��, then the two half-transformations can be formulated as

��i��j� � Yij�� � �

�,

C�i† X�

��Cj � Ci†X��Cj (3a)

�ai�bj� � Zabij � �

�,�

C�a† Y��

ij C�b � Ca†YijCb (3b)

The disadvantage of this approach is that the eightfold permuta-tional symmetry of the AO integrals cannot be fully utilized,increasing the integral computation burden fourfold. However, theincreased efficiency of the matrix formulation more than compen-sates for this for larger calculations. As each matrix Y�� is formed,its nonzero elements are written to disk in compressed format(originally as four-byte integers; see later). For the second half-transformation, the Y�� (which contain all indices i, j for a given�,� pair) have to be reordered into Yij (which contain all indices�,� for a given i,j pair). This is accomplished via a standardYoshimine bin sort.19 The sorted bin files are then read back foreach i,j pair, transformed via eq. (3b), and each pair’s contributionto the correlation energy, eij, is computed and summed. In mostcases, core orbitals need not be correlated because their correlationenergy is largely constant. All calculations reported in this articlecorrelate only the valence orbitals.

The entire scheme is straightforward to implement, providedthere is enough memory to store a (potentially) complete X��

exchange matrix, and enough disk space to hold all possible(compressed) Y�� matrices. Although the memory demand for thefirst half-transformation is only O(N2) in principle (where N is thenumber of basis functions), efficiency demands that AO integralsare calculated in batches over whole shells. Ideally, this requiress2N2 double-words of fast memory, where s is the maximum shellsize. With 1-GB memory this translates to a maximum of around1200 basis functions if the basis contains (spherical) g-functions,1500 for f-functions, and 2200 for d-functions if paging is to beavoided. We have found that the above limits can be exceededwithout significant paging penalty if the matrices X�� are stored inthe proper order. Nevertheless, the storage of the set of X��

matrices is a potential bottleneck for large basis sets containinghigh angular momentum functions. The peak memory demandcould be diminished with minimum computational penalty bycalculating only blocks (submatrices) of the AO integral matricesX��, and making multiple passes over these blocks. This has notyet been implemented.

A key to the efficiency of our MP2 program is prescreening ofthe AO integrals (based on the Schwarz inequality20 and discussedin detail in ref. 2), and the compacting of the AO exchangematrices X��, which allows the use of highly efficient dense matrixmultiplication routines. We compute only those AO integrals thatmake a contribution above an appropriate threshold (default 10�9)to the pair correlation coefficients. Because of the efficiency of theprescreening, the second half transformation is often as importantcomputationally as the first in our program. Symmetry can easilybe utilized in the first half-transformation (this is more difficult inthe second) by calculating only those integrals (����), whichhave symmetry-unique shell pairs M,L where ��M ,��L.

The serial algorithm2 consists of three main parts. The first partinvolves a loop over the symmetry-unique shell pairs M,L, calcu-lating all contributing integrals (����) with ��M ,��L, storingthem in the matrices X��, transforming to Y��

ij using the occupiedblock of the MO coefficient matrix, and writing the Y�� matricesto disk. The second part comprises the Yoshimine bin sort, inwhich the Y��

ij are read back from disk, sorted into bins for eachi,j pair, and the bins written to disk as they are filled. The third andfinal part involves reading each bin belonging to a given i,j pair,forming the corresponding Yij matrix, transforming to Zij

ab usingthe virtual block of the MO coefficient matrix, and forming andsumming the pair correlation energies, eij.

The Parallel Algorithm

Using message passing, parallelization of the first part of the serialalgorithm is straightforward; one simply loops over the M,L shells,sending each shell pair to an appropriate slave. Assuming there arenslv slaves, the first nslv shell pairs are sent to each slave in turn;thereafter, all subsequent shell pairs are sent to whichever slavebecomes available next, i.e., has finished its current shell pair andhas requested more work from the master. At the end of the firsthalf-transformation, each slave node will contain one or morehalf-transformed files containing compressed Y�� matrices forwhichever �,� pairs were computed on that slave. Although the

A Parallel Algorithm for the Calculation of Canonical MP2 Energies 1151

long-standing 2-GB file size limit has recently been lifted in mostUnix operating systems, including Linux, we still retain this limitfor compatibility; any half-transformed integral file that ap-proaches 2 GB in size is closed, and a new file is opened.

The second half-transformation is done in essentially the sameway, i.e., divide the i,j pairs equally among the slaves and trans-form each Yij matrix on the slave it is assigned to. Unfortunately,each Y�� matrix contains all i,j pairs, and so a straightforward binsort on each slave separately generates all possible i,j pairs onevery slave. Consequently, we have modified the bin sort part ofthe algorithm. Our parallel bin sort starts by spawning a secondprocess on each existing slave node, a “bin write” (or “bin listen”)process. Whenever, during the bin sort, a particular bin on a givenslave for a given i,j pair is full, instead of writing it to disk on thesame slave, it is sent to the “bin write” process running on the slavethe i,j pair is assigned to. The sort process knows in advance whichi,j pairs should be sent to which slaves. The “bin write” process onthe appropriate slave then writes the bin to its own local disk. Themain advantage of this method is that it is asynchronous (ornonblocking): the send process (both in PVM and MPI) returns,and the sort is resumed as soon as the bin is on its way to theappropriate slave. At the end of the sort the “bin write” processesare killed, and each slave will have one or more sorted half-transformed integral files containing all �,� pairs for a subset ofthe i,j pairs.

A disadvantage of doing the parallel bin sort in this way is thatcommunication is not overlapped with floating-point computation.In principle, the sort could be handled during the first half trans-formation (instead of after), utilizing the computational resourcesbetter. However, this would require additional memory, which is abottleneck in many calculations, and so we have postponed im-plementation until our blocking first half-transformation (dis-cussed above), with its reduced memory demand, is available.

The final half-transformation is done on each slave, and in-volves only the subset of the total number of i,j pairs that are onthat slave. Each slave then computes a partial pair-correlationenergy sum, and the partial sums are sent back to the master for thefinal summation to give the full MP2 correlation energy.

As previously noted during the discussion of the serial algo-rithm, there must be enough space on disk to store all the Y��

matrices (the nonzero half-transformed integrals). In this regard,the parallel algorithm has a considerable advantage, as the com-bined disk space available to it is the sum of the disk space on allthe nodes. Disk space is needed not just for the Y�� matrices butalso for the Yij matrices during the bin sort. It is possible to do thebin sort and the second half-transformation in multiple passes;indeed, the original serial algorithm allowed only one bin sort fileand did only as many i,j pairs in each pass as would fit on one2-GB file. This limitation has been lifted in the parallel algorithm,which does multiple passes only if there is insufficient disk spacein total to hold all the bin sort files in one pass. After eachhalf-transformed integral file is read and sorted, it can be deleted,thus freeing up more disk space for the bin sort files.

A further change from the original serial algorithm2 is theintegral compression scheme. Originally, the integral value wasdivided by the neglect threshold and the resulting number wasstored (and written to disk) as a four-byte integer. On reading backfrom disk, the stored integer was multiplied by the threshold to

restore the original integral (less threshold truncation error). Thiseffectively reduces integral storage by half compared to a fulleight-byte real. However, this simple scheme limits somewhat theintegral neglect threshold, which cannot take values much less than10�9 (the default) without risking integer overflow (the largestinteger that can be stored in a four-byte word is 2,31 allowing onebit for the sign). Although the default threshold is perfectly ade-quate for many molecules using standard basis sets, under certaincircumstances (e.g., basis sets containing diffuse functions) thethreshold should be tightened to ensure reliable final MP2 ener-gies.

One possibility would be to flag all integrals that are too largeto store in four bytes, and store these separately in full precision asreals. However, this would complicate the code, and it is difficultto estimate in advance how many errant integrals there are likelyto be. If there are too many, and they are not handled properly, theoverall efficiency of the algorithm would suffer. Our solution is toallow an extra byte for integral storage, effectively mimicking afive-byte integer. The extra byte stores the number of times 231

exactly divides the integral (in integer form), with the remainderbeing stored in four bytes. This slightly increases the integralpacking/unpacking time, and increases disk storage by 25% duringthe first half-transformation, but allows tightening of the integralthreshold, if needed, by over two-and-a-half orders of magnitudeto around 5 � 10�12.

A useful criterion to indicate potential threshold problems is tomonitor the lowest eigenvalue of the overlap matrix (which isusually calculated during the SCF procedure). If this is too small,the threshold should be tightened. (Very small eigenvalues indicatenear linear dependency in the basis.) Our new default is to take thethreshold to be the minimum of the square of the lowest eigenvalue(in atomic units), or 10�9 (the original default), with an absoluteminimum of 10�11; a warning message is printed if the lowesteigenvalue is less than 3.2 � 10�6. This should suffice to givemicrohartree accuracy for all but the worst cases.

The pseudocode for the parallel algorithm is shown in Figure 1.Note that, for those systems for which the available fast memoryper node is the determining factor, the maximum system size thatcan be handled by the parallel algorithm is no greater than for theserial algorithm. However, in practice, the parallel algorithm canhandle larger systems than the serial as quite often the determiningfactor is the amount of available disk space, which is typicallymuch greater for the parallel code.

Examples and Discussion

All calculations have been performed on QuantumStation™ andQuantumCluster™ computers.21 These are small (typically 4–16processors) and intermediate (32–128 processors) Linux clustersconstructed from PC-based hardware, integrated with the modernPQS quantum chemistry program. All major program functionalityruns efficiently in parallel.

Tables 1 and 2 present timings for single-point MP2 energieson systems containing up to 120 atoms and 1000 basis functions.These are fairly typical examples of the type of system that can behandled routinely with our MP2 algorithm. The integral neglectthreshold during the first half-transformation was based on the

1152 Baker and Pulay • Vol. 23, No. 12 • Journal of Computational Chemistry

lowest eigenvalue of the overlap matrix (see above); in all casesbut one this was 10�9. The exception was -pinene, which used athreshold of 1.2 � 10�11; using 10�9 resulted in an error in theenergy of over 2 millihartree. (Calculations were repeated forseveral of the other molecules with a tighter threshold; in all casestried, the final MP2 energy differed by at most a few microhartree.)All occupied MOs with orbital energies less than �3.0 hartreewere considered as core orbitals and were not correlated. The totalnumber of basis functions, the number of orbitals correlated andthe number of virtual orbitals, together with the SCF and MP2energies, are shown in Table 1. Calculations were performed on

four and six nodes, respectively, of a QS12-1200R QuantumSta-tion,21 which uses 1.2-GHz Athlon processors, and has 1-GBPC2100 DDR ECC memory and 30-GB scratch storage per node.These calculations were sufficiently small so as not to tax eitherthe available memory or the disk storage, and the bin sort andsecond half-transformation were done in a single pass. Elapsedtimes for the SCF step, the first half-transformation (includingintegral calculation), the bin sort, the second half-transformation,and the total MP2 time are given in Table 2.

MP2 calculations with a small number of correlated orbitalsrelative to the total number of basis functions (e.g., large basis setswith multiple polarization functions) are dominated by the firsthalf-transformation (often by the time needed to compute theatomic integrals); as this ratio increases the second half-transfor-mation becomes dominant. In the former case, the total time tocompute the MP2 energy is often around the same or even lessthan for the corresponding SCF (e.g., -pinene, Si17H20; Table 2).(Note that all SCF energies were fully direct, with integrals re-computed as needed at each SCF cycle; our new semidirect algo-rithm should reduce total SCF times for these systems by 50% ormore.)

Parallel efficiencies for the MP2 calculations are very good, inmost cases as good as or better than for the corresponding SCF(average ratios of the total SCF and MP2 elapsed times, tSCF andtMP2, on four and six processors, relative to the correspondingsingle processor times are 3.72 and 5.29 for the SCF and 3.82 and5.59 for MP2, respectively). The part that usually scales worst is,not surprisingly, the parallel bin sort, which involves an additionalspawned process and is bound by input/output and communication.For very small systems [e.g., (water)10, aspirin], the sort actuallytakes longer on six nodes than it does on four. As the system sizeincreases, so does the efficiency of the bin sort. For large systemswith relatively small basis sets (i.e., those systems that are domi-nated by the second half-transformation) the parallel bin sort is far

Figure 1. Pseudocode for the parallel MP2 algorithm.

Table 1. SCF and MP2 Energies for Some Representative Systems.

Molecule Atoms Symm Basisa nbfb norbc nvirtd ESCF EMP2

(Water)10 30 C1 6-31G* 190 40 140 �760.093752 �761.984558(Water)20 60 C1 6-31G* 380 80 280 �1520.137789 �1523.933927(Water)30 90 C1 6-31G* 570 120 420 �2280.059493 �2285.770180(Water)40 120 C1 6-31G* 760 160 560 �3040.156576 �3047.772785Aspirin 21 C1 6-311G** 282 34 235 �645.079387 �647.097033Porphine 38 D2h 6-31G** 430 57 349 �983.212937 �986.544309Yohimbine 52 C1 6-31G** 520 69 425 �1143.868395 �1147.611190-Pinene 26 C1 6-311G (3df, 3p) 542 28 504 �388.066943 �389.757430(Glycine)10 73 C1 6-31G* 638 114 483 �2144.056964 �2150.161078Calix1e 68 C2h cc-pvdz 664 92 536 �1529.889518 �1534.912087Si17H20 37 C2v cc-pvdz 858 44 729 �4923.760850 �4925.810073Cadion 40 C1 cc-pvdz 976 64 886 �1166.282816 �1170.768119

aBasis sets: 6-31G,24 6-311G25—standard Pople split-valence basis; cc-pvdz26—correlation consistent polarizedvalence double zeta.bNumber of basis functions.cNumber of orbitals correlated.dNumber of virtual orbitals.eTetramethoxy-calix[4]arene (C32H32O4), up-up-down-down conformer.22

A Parallel Algorithm for the Calculation of Canonical MP2 Energies 1153

more efficient than the serial sort, and actually shows superlinearscaling, i.e., speed ups greater than the number of processors [e.g.,the larger water clusters and (glycine)10]. The reason for this is not

clear; it is probably due to the asynchronous nature of the bin sortand to the fact that in the serial algorithm, all the half-transformedand bin sort files are on the same disk, and the individual read/

Table 2. Elapsed Times (in Minutes) for Various Steps in the Calculation of Canonical MP2 Energies for theSystems Shown in Table 1 on 1-, 4-, and 6-Processors of a QS12-1200R QuantumStation (using 1.2 GHzAthlon CPUs and 1 GB Fast Memory).

Molecule

Single Processor Timings 4 Processor Timings 6 Processor Timings

tSCF t1a tsort

b t2c tMP2

d tSCF t1a tsort

b t2c tMP2

d tSCF t1a tsort

b t2c tMP2

d

(Water)10 2.00 2.30 0.14 0.37 2.81 0.69 0.57 0.50 0.09 1.18 0.46 0.39 0.69 0.06 1.15(Water)20 12.4 18.9 7.5 18.7 45.1 3.5 4.7 3.2 4.7 12.4 2.6 3.2 2.8 2.4 8.5(Water)30 34.0 66.4 99.6 133.4 299.4 9.8 16.9 14.5 48.3 80.8 7.3 11.2 11.3 31.4 54.7(Water)40 69.9 174.2 337.3 527.4 1039 20.3 41.4 42.7 191.3 279.8 15.0 27.6 32.4 125.9 189.1Aspirin 19.1 17.0 0.16 0.97 18.2 5.1 4.3 0.33 0.22 4.8 3.5 2.8 0.41 0.16 3.4Porphine 12.3 13.9 0.32 7.6 21.9 3.5 3.4 0.39 1.9 5.8 2.6 2.3 0.52 1.3 4.2Yohimbine 75.7 86.2 9.5 26.4 122.2 19.6 21.2 4.6 7.9 34.0 13.5 14.1 3.4 4.9 22.6-Pinenee 309.9 323.9 0.74 5.4 330.0 77.6 76.9 1.3 1.2 79.5 52.0 51.6 1.7 0.9 54.3(Glycine)10 70.2 91.0 94.1 132.2 387.9 19.2 21.5 10.4 42.4 75.7 13.7 14.3 8.3 27.5 51.5Calix1 93.5 92.1 7.7 73.7 173.5 24.8 23.1 3.6 20.1 47.6 17.4 15.2 2.7 12.7 31.5Si17H20 377.7 331.0 1.9 39.9 372.7 96.6 80.2 2.2 10.0 92.8 66.0 52.9 1.8 6.7 61.8Cadion 771.9 730.6 51.5 148.3 930.3 193.1 172.9 13.9 41.9 229.5 130.1 114.6 10.3 27.1 152.8

aElapsed time for first half-transformation.bElapsed time for parallel bin sort.cElapsed time for second half-transformation.dTotal MP2 time.eIntegral neglect threshold for -pinene 1.2 � 10�11; all other molecules used 10�9.

Table 3. Results for Chlorophyll a, Taxol, and Tetraphenylporphine on 6 Processors of a QS12-1200R QuantumStation.

Molecule Chlorophyll a Taxol TetraphenylporphineAtoms 137 113 78Symmetry C1 C1 D2h

i

Basis vdzp 6-311G** 6-311G (2df, 2pd)nbfa 1266 1422 1860norbb 175 164 113nvirtc 1025 1196 1699Max disk (GB)d 26 26 26Disk usagee 22 GB (11 files) 24 GB (12 files) 4 GB (2 files)npassf 10 21 1tSCF 143 255 143t1

g 179 411 155tsort � t2

h 681 814 470tMP2 878 1250 636ESCF �2914.428668 �2912.569429 �1901.925096EMP2 �2923.570216 �2922.186323 �1909.687507

aNumber of basis functions.bNumber of orbitals correlated.cNumber of virtual orbitals.dMaximum allowed scratch file space per node in GB.eDisk usage (number of files) per node for the half-transformed integrals.fNumber of passes required for second half-transformation.gElapsed time (minutes) for first half-transformation.hElapsed time (minutes) for bin sort � second half-transformation.iAll four phenyl groups were perpendicular to the plane of the porphyrin ring.

1154 Baker and Pulay • Vol. 23, No. 12 • Journal of Computational Chemistry

write heads have to cover a much larger disk area than if the dataare spread out over several disks. The parallel bin sort usuallytakes less time (often considerably less) than either of the twohalf-transformation steps, so does not unduly impact the overalljob time or the parallel efficiency on small numbers of nodes(4–16).

Table 3 presents results for some larger, more demanding,molecules. These were all run on six nodes with identical defaultsto the systems in Table 1. All three systems need a large amountof disk storage for the half-transformed integrals; in fact, it is clearthat on a typical single PC or workstation with, say, 20-GB scratchdisk space, none of these calculations could have been carried out.For example, chlorophyll-a wrote 11 half-transformed integral files(occupying 22 GB of scratch storage) per node (a total of 6 �22 132 GB). The maximum available scratch storage allocatedto this job was 26 GB per node (see Table 3), and so there was only4 GB per node remaining for the bin sort, i.e., a maximum of twobin sort files could be written at any one time. Consequently, thesecond half-transformation required 10 passes. For taxol, there wasonly sufficient scratch space left after the first half-transformationto store one bin sort file, and the second half-transformationrequired 21 passes. Note that, unlike for the more traditional MP2algorithms (which correlate as many electrons as possible duringeach pass, and require complete recalculation and transformationof all integrals on every pass), multiple passes during the secondhalf-transformation have only a minor impact on the total job time.

The final example in Table 3, tetraphenylporphine, is the larg-est calculation (in terms of the number of basis functions) reportedin this work. However, because of its high symmetry (D2h), it wasfar less demanding of computational resources than either chloro-phyll-a or taxol; there were only two half-transformed integral filesper node (less than 4 GB total disk storage) and the secondhalf-transformation was completed in a single pass. The greater

elapsed time for the bin sort and second half-transformation(tsort � t2; Table 3) than for the first half-transformation (t1)reflects the use of symmetry during the first half-transformation(but not during the second). The total elapsed time to compute theMP2 energy , 636 min (just over 10.5 h), was also less than for thetwo large C1 molecules.

We have tested the scaling of our parallel MP2 algorithm on alarger Linux cluster (a 64 processor QuantumCluster™;21 com-prising 32 dual 1-GHz PIII nodes with 512 MB PC133 ECCmemory per node). Figure 2 shows the speedup (relative to atwo-processor run which, for simplicity, is assigned a value of 2)for an MP2/6-311G** energy calculation on N-formyl penta-alanyl amide (-helix conformation, C1 symmetry, 672 basis func-tions). The calculations utilized both processors per node. As canbe seen from the figure, the first half-transformation scales verywell, with a speed-up of over 46 on 50 processors. The secondhalf-transformation (which includes the parallel bin sort) as ex-pected does not scale as well, but even so the overall scaling for thetotal MP2 time is a respectable 34 on 50 processors (25 dualprocessor nodes).

Figure 3 shows a larger example, tetramethoxy-calix[4] arene(the same example as in Table 1, but with the larger cc-pVTZ basiswith 1528 basis functions). In this case the large memory require-ment allowed only one slave process per node. The absolute timingon 24 nodes (2.3 h) is significantly better than that reported byBernholdt et al.22 for canonical MP2 (17 h on 128 nodes of an IBMRS/6000 system, using 120 MHz Power 2 super CPUs; accordingto SPEC benchmarks,23 this is about a factor of 2 slower than our1 GHz PIII), and compares favorably with the approximate RI-MP2 method (4.2 h on 64 nodes without symmetry). Scaling withthe number of processors is better in this example (about 22 out of24 nodes for the total MP2), reflecting the increased weight ofcomputation versus communication for the larger basis set.

Figure 2. Speed-up of the total MP2 calculation and the first halftransformation for N-formyl pentaalanyl amide, C16H28N6O6,6-311G** basis set, 672 basis functions, no symmetry, two slavesrunning on each dual-processor node (1 GHz Pentium III). Thespeed-up for a 2-processor calculation is defined to be 2.

Figure 3. Speed-up of the total MP2 calculation and the first halftransformation for tetramethoxy-calix[4]arene, C32H32O4, cc-pVTZbasis set, 1528 basis functions, C2h symmetry, default threshold, oneslave running on each dual-processor node. The speedup for a 4-pro-cessor run is defined to be 4.

A Parallel Algorithm for the Calculation of Canonical MP2 Energies 1155

Supplementary Material

Full output files (which include molecular geometries) can bedownloaded from the PQS website by anonymous FTP frompqs-chem.com (see the MP2 directory).

Acknowledgments

The authors would like to thank Mr. Matthew Shirel for hiscontributions to the parallel bin sort algorithm in the early phase ofthis project. We also thank the Department of Computer Scienceand Engineering, University of Arkansas for time on their 64-processor QuantumCluster™ for scaling tests.

References

1. See, for example, the Web site for the Linux Documentation Projecthttp://www.linusdoc.org.

2. Pulay, P.; Saebo, S.; Wolinski, K. Chem Phys Lett 2001, 344, 543.3. Ziegler, T. Chem Rev. 1991, 91, 651, and references therein.4. Johnson, B. G.; Gill, P. M. W.; Pople, J. A. J Chem Phys 1993, 98,

5612.5. Pople, J. A.; Binkley, J. S.; Seeger, R. Int J Quant Chem Symp 1976,

10, 1.6. Kristyan, S.; Pulay, P. Chem Phys Lett 1994, 229, 175.7. Pulay, P.; Saebo, S.; Meyer, W. J Chem Phys 1984, 81, 1901.8. Whiteside, R. A.; Binkley, J. S.; Colvin, M. E.; Schaefer, H. F., III.

J Chem Phys 1987, 86, 2185.

9. Watts, J. D.; Dupuis, M. J Comput Chem 1988, 9, 158.10. Limaye, A. C.; Gadre, S. R. J Chem Phys 1994, 100, 1303; Limaye,

A. C. J Comput Chem 1997, 18, 552.11. Marquez, A. M.; Dupuis, M. J Compt Chem 1995, 16, 395.12. Nielsen, I. M. B.; Seidl, E. T. J Comput Chem 1995, 16, 1301.13. Wong, A. T.; Harrison, R. J.; Rendell, A. P. Theoret Chim Acta 1996,

93, 317.14. Schutz, M.; Lindh, R. Theoret Chim Acta 1997, 95, 13.15. Sosa, C. P.; Ochterski, J.; Carpenter, J.; Frisch, M. J. J Comput Chem

1998, 19, 1053.16. Fletcher, G. D.; Schmidt, M. W.; Gordon, M. S. Adv Chem Phys 1999,

110, 267.17. Geist, A.; Beguelin, A.; Dongarra, J.; Jiang, W.; Manchek, R.; Sun-

deram, V. PVM: Parallel Virtual Machine. A Users’ Guide and Tuto-rial for Networked Parallel Computing; MIT Press: Cambridge, 1994.

18. Saebo, S.; Almlof, J. Chem Phys Lett 1989, 154, 83.19. Yoshimine, M. J Comp Phys 1973, 11, 333.20. Abramowitz, M.; Stegun, I. A., Eds.; Handbook of Mathematical

Functions; Dover: New York, 1972.21. Parallel Quantum Solutions, 2013 Green Acres Road, Suite A, Fayet-

teville, AR 72703. http://www.pqs-chem.com.22. Bernholdt, D. E. Parallel Comput 2000, 26, 945.23. Standard Performance Evaluation Corporation (http://www.spec.org/

cgi-bin/osgresults). The interpolated specfp95 score, which closelycorresponds to performance in floating-point intensive tasks. is 19.3.for the 120 MHz IBM Power 2, while the intrapolated score for the 1GHz Pill is 33.5, i.e., a factor of 1.73 higher.

24. Harihan, P. C.; Pople, J. A. Theoret Chim Acta 1973, 28, 213.25. Krishnan, R.; Binkley, J. S.; Seeger, R.; Pople, J. A. J Chem Phys

1980, 72, 650.26. Dunning, T. H. J Chem Phys 1989, 90, 1007.

1156 Baker and Pulay • Vol. 23, No. 12 • Journal of Computational Chemistry