An efficient parallel algorithm for the calculation of unrestricted canonical MP2 energies

DOI: 10.1002/jcc.21924

An Efficient Parallel Algorithm for the Calculation ofUnrestricted Canonical MP2 Energies

Jon Baker*[a,b] and Krzysztof Wolinski[a,c]

We present details of our efficient implementation of full accuracy

unrestricted open-shell second-order canonical Møller–Plesset

(MP2) energies, both serial and parallel. The algorithm is based on

our previous restricted closed-shell MP2 code using the Saebo–

Almlof direct integral transformation. Depending on system details,

UMP2 energies take from less than 1.5 to about 3.0 times as long as

a closed-shell RMP2 energy on a similar system using the same

algorithm. Several examples are given including timings for some

large stable radicals with 90þ atoms and over 3600 basis functions.VC 2011Wiley Periodicals, Inc. J Comput Chem 32: 3304–3312, 2011

Keywords: UMP2 energies � Saebo–Almlof direct integral transformation � parallel algorithmIntroduction

Whenever a new computational algorithm is developed for

the calculation of some theoretical quantity (a new implemen-

tation of an existing method, as is the case in this work, or an

initial implementation of a new quantity), it is nearly always

done first for the closed-shell case. Not only does this provide

considerable simplification in the actual coding, but as the

vast majority of chemical systems are closed-shell, then the

resulting algorithm has immediate maximum applicability. Not

infrequently, the unrestricted open-shell version of the new

algorithm is either not developed at all, or done much later.

Although most of the algorithmic difficulties have already

been overcome during the closed-shell development, the

open-shell version—often dismissed as a ‘‘trivial’’ modification

to the closed-shell code—may be far from straightforward in

addition to involving a lot more work.

An unrestricted version of a method is important, not only for

genuine open-shell systems but also for situations where the

formally closed-shell wavefunction is UHF unstable, that is, the

energy can be lowered by breaking the spatial symmetry of the

a and b molecular orbitals. Closed-shell versions of virtually any

wavefunction rarely describe bond-breaking adequately, and

the unrestricted version is usually much better. Even situations,

where bonds are stretched, for example, in transition states,

may be better described at the unrestricted level, depending on

the degree of spin contamination in the resulting wavefunction.

This article presents some details of the unrestricted open-

shell canonical second-order canonical Møller–Plesset (MP2)

energy module as implemented in the PQS program pack-

age.[1,2] The basic design of the algorithm follows that of

the closed-shell version[3,4] and is again based on the Saebo–

Almlof direct integral transformation.[5] However, we have

modified certain aspects of the algorithm, both serial and par-

allel, particularly the resorting of the initial half-transformed

integrals (the bin-sort step).

The unrestricted MP2 energy can be written as

EUMP2 ¼ Eaa þ Eab þ Ebb (1)

where

Eaa ¼X

I�J

eIJ

¼X

A>B all MOs a�spin

½ðIAjJBÞ � ðIBjJAÞ�2=ðeI þ eJ � eA � eBÞ

(1a)

Eab ¼X

I;J

eIJ ¼X

A;B I;A a�spin J;B b�spin

ðIAjJBÞ2=ðeI þ eJ � eA � eBÞ

(1b)Ebb ¼

X

I�J

eIJ

¼X

A>B all MOs b�spin

½ðIAjJBÞ � ðIBjJAÞ�2=ðeI þ eJ � eA � eBÞ

(1c)

Here indices I and J represent occupied molecular orbitals

(MOs), A and B represent virtuals, (IA|JB) represents a fully

transformed MO-integral, e are the UHF orbital energies, and

eIJ is a so-called pair energy, which is the contribution to

the UMP2 energy from the occupied orbital pair I,J. This

form of the UMP2 energy requires an integral transformation

from the original AO- to the MO-basis, and in the Saebo–

Almlof method, this is accomplished via two half-transforma-

tions, involving first the two occupied MOs and then the

two virtuals. The AO-integral (lm|kr) can be regarded as the

(mr) element of a generalized (exchange) matrix Xlk, and

the two half-transformations can be formulated as matrix

multiplications

[a] J. Baker, K. Wolinski

Parallel Quantum Solutions, 2013 Green Acres Road, Suite A, Fayetteville,

Arkansas 72703

E-mail: [email protected]

[b] J. Baker

Department of Chemistry, University of Arkansas, Fayetteville, Arkansas

72701

[c] K. Wolinski

Department of Chemistry, Maria Curie-Sklodowska University, Lublin, Poland

Additional Supporting Information may be found in the online version

of this article.

Journal of Computational ChemistryVC 2011 Wiley Periodicals, Inc.3304

ORIGINAL ARTICLES

ðlIjkJÞ ¼ YlkIJ ¼ P

m;rCTmIX

lkmrCrJ ¼ CT

I XlkCJ

ðAIjBJÞ ¼ ZIJAB ¼

Pl;k

CTlAY

IJlkCkB ¼ CT

AYIJCB

(2)

The disadvantage of this approach—that the full permuta-

tional symmetry of the AO-integrals cannot be utilized—is

more than offset by the computational efficiency of the matrix

formulation, especially for larger systems. A further gain is that

the sparsity of the AO-integral matrix is such that entire rows

and columns are often zero; these can be eliminated, and the

AO-matrix compressed to give smaller, dense matrices that can

use highly efficient dense matrix multiplication routines[3] as

opposed to the usual sparse matrix techniques that are much

less efficient.

As shown in eq. (2), the second half-transformation requires a

reordering of the half-transformed integrals from YlkIJ (i.e., all I,J

indices for each l,k pair) to YIJlk (i.e., all l,k indices for each I,J

pair), and in the closed-shell algorithm, this was accomplished

by a modified Yoshimine bin sort.[6] This step has been further

modified in the UMP2 algorithm as will be described later.

In order for the algorithm to work, there must be enough

disk storage to hold all the nonzero half-transformed inte-

grals (lI|kJ). These are stored as 5-byte integers[2,4] which

reduces the total storage required by 37.5% compared with

a full 8-byte floating point real. Note that, in parallel, the

aggregate disk storage over all nodes is available, and so the

parallel algorithm can handle somewhat larger systems than

the serial algorithm. Memory demands are modest,

although—for reasons of efficiency—the AO integrals are

calculated in batches over complete shells, and this requires

a minimum of s2N2 double words (1 double word ¼ 8 bytes)

of fast memory (where N is the number of basis functions

and s is the shell size, i.e., s ¼ 1 for s shells, s ¼ 3 for p

shells, s ¼ 5 for (spherical) d shells, etc.); if this is not avail-

able, then only those integrals that can be successfully

handled in the available memory are actually utilized, and

that batch is recalculated as many times as necessary. In

practice, calculations involving up to around 2000 basis func-

tions can readily be carried out with only modest memory

demands (say 2 GB RAM per CPU), and no recalculation of

AO integrals unless there are high angular momentum basis

functions in the basis set.

The UMP2 Algorithm

The major complication in the open-shell algorithm is that the

presence of two sets of MOs, corresponding to a and b spin,

results in three different sets of final transformed integrals

(aa|aa), (aa|bb), and (bb|bb) (as shown in eqs. (1a)–(1c), respec-

tively) compared with just one in the closed-shell case (where

the spatial parts of the a- and b-spin orbitals are identical).

Additionally, the I ¼ a, J ¼ b and I ¼ b, J ¼ a half-transformed

ab integrals are stored separately. The way this is handled is to

do all three first half-transformations together, as the raw

atomic integrals are computed, giving three half-transformed

integral files at the same time. (The disk storage requirements

are thus more-or-less four times those of a similar closed-shell

calculation, with the ab integral file being twice the size of the

other two.) Once the half-transformed integrals have all been

stored on disk, the second half-transformations are carried out

essentially independently on each file separately.

If there is sufficient storage, then all the half-transformed

integrals from a single file can be read, sorted, and written to

disk before any of them are further transformed. If the half-

transformed integral file is too large to effectively duplicate,

then the second half-transformation can be done in multiple

passes, doing as many I,J indices in each pass as can be suc-

cessfully sorted and stored in the available disk storage. It

turned out that the time penalty for doing multiple passes

compared with a single pass in the original RMP2 algorithm

was negligible for small numbers of passes.[4] Furthermore,

when the algorithm was first developed (the early 2000s), it

was awkward to have files larger than 2 GB on the 32-bit archi-

tecture machines available at that time, and so the second half-

transformation was limited in each pass to a subset of the I,J

pairs for which all sorted integrals could be written to a single

2 GB file. (Actually for a direct access file, it was not the overall

size of the file that counted but the size and total number of

the records written to it.) If necessary (which was often the

case), the first half-transformed integrals were written to multi-

ple files. This file size limitation has of course been lifted for the

foreseeable future on 64-bit architecture machines.

In the UMP2 algorithm, we have added an ‘‘in-core’’ option

which limits the number of I,J pairs in each pass of the second

half-transformation step to those that can be sorted and trans-

formed in the available memory. In the serial algorithm, this

eliminates the bin-sort file altogether and hence requires no

additional disk storage, at the cost of potentially multiple

reads of the half-transformed integral file (which is, however,

read sequentially). This is faster than the standard ‘‘bin-sort’’

algorithm if all the I,J pairs can be processed in a relatively

small number of reads of the integral file, but becomes more

and more inefficient as the system size increases. However, it

does allow calculations on systems that are so large relative to

the available resources that there is little or no disk storage

left after computing and saving all the first half-transformed

integrals. Note that once a specific contribution to the total

MP2 energy has been computed, say Eaa [eq. (1a)], the corre-

sponding integral file is deleted, thus freeing up disk storage

for the bin sort. Consequently, the less efficient ‘‘in-core’’ algo-

rithm should normally be confined to the calculation of one

energy term only, if it is used at all.

Perhaps the biggest modification in the UMP2 algorithm has

been to the bin-sort step. In the original RMP2 algorithm, each

sorted 5-byte integral was stored together with its two AO-

indices (as 2-byte integers), and so each integral in effect

required 9 bytes of storage. This was done primarily, because

the magnitude of each integral was checked as it was read

from the half-transformed integral file and integrals whose

contribution would be below the threshold were discarded.

The two AO-indices are of course common for all I,J pairs in

that record on the half-transformed integral file and so, if no

Efficient Algorithm for the Calculation of Unrestricted Canonical MP2 Energies

Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 3305

integrals are discarded, only need to be written once. An array

for the AO indices is now kept and written separately from the

5-byte integrals. The bulk of the very small integrals are elimi-

nated at the AO-level by a highly efficient integral neglect

scheme based on localized MOs,[3] and the number that can

be further neglected from the half-transformed integrals is, rel-

atively speaking, not significant. The additional work involved

in handling those few integrals that could have been

neglected is more than offset by the subsequent simplification

in the code and the considerable reduction in disk storage for

the sorted integrals. Note that the 5-byte integral storage

procedure used originally[4] was subsequently modified so as

to remove all possibility of integer overflow,[2] and it is the

modified procedure that is used here.

The size of each record on the bin-sort file (i.e., the number

of I,J bins written at any one time) is carefully selected to cor-

respond to the number of I,J indices that can be read back

and transformed in the available memory, that is, to the num-

ber of YIJ matrices that can be handled in core at once. In the

parallel algorithm, the number of I,J pairs is further divided

into the number of slaves and after the half-transformed inte-

grals on each slave’s local disk have been sorted, they are sent

to the slave that has been assigned to deal with that particular

batch of I,J pairs, which subsequently writes them to a

local bin-sort file prior to transforming them and forming the

corresponding pair energies.

Results and Discussion

Table 1 presents serial timings (in minutes) for computation of

the MP2 energy (starting from converged MOs from the

preceding SCF step) for a number of formally closed-shell-

molecular systems, ranging in size from a few hundred to over

two thousand basis functions. Their structures are shown sche-

matically in Figure 1. In all cases, core orbitals (defined as

those MOs with orbital energy below �3.0 au) were omitted

from the calculation. Both RMP2 and UMP2 calculations were

performed on each system. This enables a direct comparison

of the performance of the closed-shell and open-shell

algorithms, as well as a check on the accuracy of the UMP2

algorithm, that is, is the answer right?

There are, in terms of computing the MP2 energy, two limit-

ing cases: (1) small molecules with (relatively) large basis sets

and (2) large molecules with (relatively) small basis sets. In the

first case (the most common), calculating the MP2 energy is

dominated by computing the AO-integrals and the first half-

transformation; in the second, it is dominated by the bin sort

and the second half-transformation. In the UMP2 algorithm,

because there are three different integral types (and effectively

four times as many half-transformed integrals), then the bin

sort and second half-transformation will take around three to

four times as long as in the RMP2 algorithm (which has only a

single integral type). The first half-transformation step, on the

Table 1. RMP2, UMP2, and SOS-MP2 single-point energy timings (in minutes).[a]

Type NBas NCorr NVirt RAM[b]

First half-trans. Second half-trans.

TotalTint Ttran Tbin Tsort Ttran

(H2O)20 MP2/6–31G*

RMP2 320 80 280 320 2.0 1.8 0.4 1.8 6.0

UMP2 320 80 280 320 1.8 4.1 1.5 0.6 5.5 13.5

SOS 1.8 2.9 0.6 0.2 3.0 8.7

C16H28N6O6 MP2/6–311G**

RMP2 672 79 565 400 21.2 11.2 1.9 10.7 45

UMP2 672 79 565 400 19.9 21.7 8.3 3.2 35.1 88

SOS 19.9 17.1 3.8 1.3 17.8 60

C20H32N10O11 MP2/6–31G*

RMP2 679 114 524 400 9.4 7.1 7.1 18.8 42

UMP2 679 114 524 400 9.4 19.8 16.7 12.6 64.0 123

SOS 10.1 13.2 8.5 5.8 31.7 70

C21H26N2O3 (yohimbine) MP2/PC-2

RMP2 1144 69 1049 1200 255 162 6.2 50.0 467

UMP2 1144 69 1049 1200 251 260 20.3 10.4 152 696

SOS 241 229 9.5 4.3 76.4 561

C32H32O4 (calix[4]arene) MP2/cc-pvtz[c]

RMP2 1528 92 1400 1200 117 140 4.4 167 428

UMP2 1528 92 1400 1200 119 249 16.0 20.7 628 1034

SOS 120 227 7.3 8.1 318 681

C44H42N4 MP2/PC-2[c]

RMP2 2028 119 1861 3000 440 509 17 649 1615

UMP2 2028 119 1861 3000 423 900 84 325 2500 4241

SOS 426 793 44 151 1297 2716

Timings reported are as follows (all timings elapsed): first half-trans.: Tint—time to compute raw AO-integrals; Ttran—time to carry out the integral trans-

formation; second half-trans.: Tbin—time to read in the half-transformed integrals, reorder indices and rewrite to bin-sort file; Tsort—time to read in reor-

dered integrals in 5-byte format and convert to full double precision; Ttran—time to carry out the integral transformation.

[a] Serial timings on a 2.26 GHz Intel Xeon E5520 quad-core X2 processor. [b] Memory request in MB. [c] C2h symmetry.

J. Baker and K. Wolinski

Journal of Computational Chemistry3306 http://wileyonlinelibrary.com/jcc

other hand, should certainly not take four times longer than it

does in the RMP2 algorithm, because calculating the AO-inte-

grals is a significant component, and this takes the same time

in both cases.

As system size increases even further, the second half-trans-

formation step will become dominant in essentially all cases

despite the formally higher scaling of the first quarter-transfo-

mation step (O(nN4) compared with O(n2N3) at worst for the

other steps, where n is the number of orbitals correlated, and

N is the number of basis functions), because increasing AO-in-

tegral neglect ultimately reduces the formal scaling of the first

step down to O(nN2).

Computing a UMP2 energy should therefore take, at worst,

no more than four times as long as the corresponding RMP2

energy for molecules of type (2), above, and in practice is

more typically a factor of around two, and potentially even

less the closer you are to the type (1) limit. This is borne out

by the timings shown in Table 1. Also shown in Table 1 are

timings for the scaled opposite-spin (SOS)-MP2 energy. This is

a procedure from Head-Gordon and coworkers[7] similar to the

earlier MP2 scaling approach of Grimme[8] who showed that

MP2 energies can be systematically improved by scaling

the opposite-spin and same-spin components of the MP2

energy separately. (The scaling factors determined by Grimme

Figure 1. Structures of the molecules used for the job timings reported in Table 1.



were 6/5 and 1/3, respectively.) Arguments are advanced in

Ref. [7] for eliminating the smaller same-spin component entirely

and compensating by having a larger scaling factor on the

opposite-spin component. (The suggested scaling factor is 1.3.)

The statistical improvement over the standard MP2 energy

for SOS-MP2[7] is similar to that found by Grimme using his

two-component scaling.[8] However, SOS-MP2 has the advant-

age that the same-spin components do not need to be

computed at all. For a UMP2 energy, this means that only the

second term in eq. (1), the ab cross term given by eq. (1b),

survives leading to potential savings in both the first and, in

particular, the second half-transformation.

The ratios between the times needed to compute the RMP2

and UMP2 energies for the two largest systems in Table 1 are

greater than they ought to be based on the discussion given

above. (They should be a factor of around 2.0 or less rather

than, e.g., the almost 2.5 for calix[4]arene.) The reason is that

both systems have symmetry (C2h which has four symmetry

operations) which in the current code can be utilized in the

first half-transformation (by evaluating only symmetry-unique

AO integrals) but not in the second. Consequently, in compari-

son with a system with no usable symmetry, the first half-

transformation step only takes about a quarter of the time it

ought to (one divided by the number of symmetry opera-

tions), giving the second half-transformation step much more

weight. For very large systems, the ratio will increase, as the

second half-transformation becomes dominant.

Parallel UMP2 energy timings are given in Table 2 for a

number of long-lived organic radicals, the structures of which

are shown in Figure 2. Triphenylmethyl was the first radical

ever discovered (by Gomberg[9]); TEMPO is widely used as a

radical trap, as a structural probe for biological systems in

conjunction with electron spin resonance (ESR) spectroscopy,

as a reagent in organic synthesis, and as a mediator in con-

trolled free-radical polymerization[10]; DPPH is also widely used

as a radical scavenger; the dodecyl syringate radical has been

proposed as a candidate for ESR molecular quantum com-

puters,[11] and the 6,11-diphenyloxy-5,12-naphthacenequinone

triplet diradical has been postulated as an intermediate in the

photoinduced phenyl ring transfer.[12] The largest system is an

aminyl diradical derived from m-phenylene synthesized by the

Rajca group.[13] The five collinearly fused rings are reported to

be approximately planar with the 4-tert-butylphenyl group

nearly perpendicular to them. As this system (94 atoms, 2020

basis functions) taxes the resources of the cluster we are

using, particularly the disk storage, Cs symmetry was enforced

to reduce the number of half-transformed integrals. All calcula-

tions used the semiempirical PM3[14] optimized geometries. To

give at least some idea of the scaling of the algorithm with

system size, we used the same basis set in all cases, Jensen’s

polarization-consistent PC-2 basis.[15] This is a TZ-type basis set

with 3s2p1d on hydrogen and 4s3p2d1f on first-row elements

and is fairly typical of the size of basis used in serious MP2

calculations.

Table 2. Parallel UMP2/PC-2 single-point energy timings (in minutes)[a] for some long-lived organic radicals ran utilizing one CPU per node.


NCPU NAtom NBas NCorr NVirt RAM[b] Ttran Tbin Tsort Ttran Total

2,2,6,6-Tetramethylpiperidine-1-oxyl ‘‘TEMPO’’ (C9H18NO) Cs4 29 582 a33 538 720 15.8 0.93 0.18 1.16 18.1

8 b32 539 7.9 1.43 0.12 0.59 10.1

12 5.3 1.77 0.13 0.39 7.6

Triphenylmethyl (CPh3) C34 34 780 a46 715 480 53.0 2.29 0.54 5.30 61.3

8 b45 716 26.6 1.15 0.29 2.67 30.8

12 17.7 0.79 0.20 1.79 20.5

Diphenylpicrylhydrazyl ‘‘DPPH’’ (C18H12N5O6) C14 41 1038 a73 936 720 135.7 16.5 6.8 30.8 190.1

8 b72 937 67.7 9.4 2.9 15.4 95.7

12 45.1 6.9 1.5 10.3 63.9

Dodecyl syringate (C21H33O5) C14 59 1242 a74 1142 1500 115.1 27.8 14.5 55.2 212.3

8 b73 1143 57.8 13.1 6.9 27.7 105.8

12 38.4 10.7 3.0 18.5 70.8

6,11-Diphenyloxy-5,12-naphthacenequinone triplet diradical (C30H18O4) C14 52 1272 a82 1156 1800 274.7 31.9 16.2 70.7 394.5

8 b80 1158 137.5 16.3 8.1 35.4 197.7

12 91.2 11.7 5.7 23.6 132.7

Triplet aminyl diradical (C42H50N2) Cs4 94 2020 a115 1861 3500 579.2 109.7 88.1 568.6 1349

8 b113 1863 289.9 57.2 47.9 284.6 681.8

12 192.7 40.1 33.2 189.7 459.4

Timings reported are as follows (all timings elapsed): first half-trans.: Ttran—time to compute raw AO-integrals and carry out the integral transformation

(not reported separately in parallel algorithm); second half-trans: Tbin—time to read in the half-transformed integrals, reorder indices and rewrite to bin-

sort file; Tsort—time to read in reordered integrals in 5-byte format and convert to full double precision; Ttran—time to carry out the integral

transformation.

[a] Parallel timings on a 3.0 GHz Intel Pentium D930 dual-core processor. [b] Memory request per CPU in MB.



The jobs were run on an at least 5-year-old PC-based cluster

with a single 3.0 GHz Intel Pentium D930 dual-core processor,

4 GB of RAM and 300 GB of striped RAID0 scratch storage per

node. Communication between the nodes was via Gigabit

Ethernet. Despite its age, the clock speed of each processor

on this cluster is comparable with the best modern PCs,

because for the past several years, machines have been get-

ting ‘‘faster’’ overall not by increasing the clock speed of each

individual CPU but by incorporating more and more CPUs

(cores) per processor. At the time of writing multicore process-

ors including up to 12 CPUs are readily available. The configu-

ration of this older cluster is actually advantageous for MP2

jobs, as modern systems with the same number of CPUs—we

utilized twelve nodes for a total of 24 CPUs—generally have

far fewer nodes and consequently less resource per CPU than

older clusters which have less CPUs per node. For example,

Figure 2. Structures of the radicals used for the job timings reported in Tables 2 and 3.



running a job over all 12 nodes allows the aggregate disk

storage on each node to be utilized; a more modern system

with all CPUs on a single node would require 12 times the

disk storage per node to maintain the same amount of disk

storage per CPU. The same would apply of course to memory

(RAM).

The situation is even worse for I/O as modern disk drives,

despite their increased capacity, typically still have only a sin-

gle I/O controller per drive. Thus, with the same disk configu-

ration, the amount of I/O resource per CPU decreases in direct

proportion to the increase in the number of CPUs per node.

As MP2 jobs involve a lot of I/O, they are likely to benefit sig-

nificantly when run over many nodes compared with running

over just a few nodes with much less I/O capability per CPU.

The single-point UMP2 energies reported in Table 2 were

run utilizing one of the two available CPUs per node. Table 3

reports the same systems run on the same number of nodes

but using both available CPUs. This doubles the number of

CPUs from a maximum of 12 to a maximum of 24, but halves

the I/O resource per CPU. The effect of the reduced I/O

capacity can be seen by comparing the 8-CPU timings

between the two tables (one ran on eight nodes using one

CPU per node, the other ran on four nodes using two CPUs

per node). Apart from the smallest system (TEMPO which only

has 582 basis functions) the total job time increases signifi-

cantly between the one- and two-CPU per node runs by up to

40%. Every step that involves I/O takes longer when both

CPUs are used. Both the first half-transformation time (Ttran,

which involves I/O when the integrals are written to disk), and

particularly, the second half-transformation (Tbin and Tsort) are

affected. That the effect is due to I/O only can be seen by

the timing for the integral transformation in the second half-

Table 3. UMP2/PC-2 single-point energy timings (in minutes)[a] for some long-lived organic radicals ran utilizing two CPUs per node.



2,2,6,6-Tetramethylpiperidine-1-oxyl ‘‘TEMPO’’ (C9H18NO)Cs8 29 582 a33 538 720 8.30 0.31 0.12 0.59 9.4

16 b32 539 4.15 0.42 0.09 0.30 5.0

24 2.78 0.52 0.09 0.20 3.6

Triphenylmethyl (CPh3) C38 34 780 a46 715 480 29.5 3.15 0.48 2.69 36.0

16 b45 716 14.7 0.94 0.24 1.35 17.4

24 9.8 0.75 0.17 0.91 11.8

Diphenylpicrylhydrazyl ‘‘DPPH’’ (C18H12N5O6) C18 41 1038 a73 936 720 71.6 24.9 13.7 15.5 126.0

16 b72 937 35.6 14.2 5.6 7.8 63.5

24 23.8 9.6 2.3 5.2 41.1

Dodecyl syringate (C21H33O5) C18 59 1242 a74 1142 1500 60.9 38.0 21.7 27.9 148.9

16 b73 1143 30.3 20.5 11.8 14.0 76.9

24 20.3 14.6 7.9 9.3 52.5

6,11-Diphenyloxy-5,12-naphthacenequinone triplet biradical (C30H18O4) C18 52 1272 a82 1156 1800 150.1 45.2 26.2 35.6 258.2

16 b80 1158 75.2 22.8 12.8 17.8 129.1

24 49.9 16.4 9.1 11.9 87.7

See comments re(garding) timing details under Table 2.

[a] Parallel timings on a 3.0 GHz Intel Pentium D930 dual-core processor. [b] Memory request per CPU in MB.

Table 4. UMP2 single-point energy timings (in minutes)[a] for the Finland trityl radical ran on the Star of Arkansas utilizing two CPUs per node.



Finland trityl radical (C40H39O6S12) PC-2 basis set C348 97 2334 a154 2074 4000 106.5 88.2 81.3 71.8 349.1

72 b153 2075 71.0 65.9 42.6 47.9 228.6

96 53.3 54.1 30.9 36.0 174.9

Finland trityl radical (C40H39O6S12) aug-cc-pvtz basis set[c] C348 97 3613 a154 3186 4000 1638 179.6 240.5 260.3 2331

72 b153 3187 1093 141.4 151.4 173.6 1569

96 820 87.9 99.8 130.4 1141

See comments re(garding) timing details under Table 2.

[a] Parallel timings on a 2.65 GHz Xeon E5430 quad-core processor. [b] Memory request per CPU in MB. [c] 167 basis function pairs suppressed due to

linearly dependent basis set.



transformation step (Ttran) which is the timing for the final

transformation only and involves no I/O; this is essentially the

same in both cases.

The job time for TEMPO actually decreases with apparently

decreasing I/O resource. Alone among the systems studied,

TEMPO shows an increasing elapsed time for the bin sort

(Tbin) with increasing number of CPUs. This strongly suggests

that communication overhead is the bottleneck here with

insufficient data sent to compensate for this. The decrease in

time when both CPUs per node are used is likely due to the

decrease in internode communication.

Note that the largest job (the aminyl diradical with 2020 ba-

sis functions) could not be run using both CPUs per node, as

it required more than 2 GB memory per CPU to run efficiently.

We actually gave this job 3.5 GB per CPU (see Table 2), far

more than the minimum needed.

As can be seen from the timings reported in both Tables 2

and 3, the parallel efficiency of the UMP2 algorithm, at least up

to 24 CPUs, is very high. Considering Table 2, then with 100%

parallel efficiency the total job time on 8 CPUs should be half,

and that on 12 CPUs should be a third, of the job time reported

on 4 CPUs. This is close to being the case for all jobs except

the smallest (TEMPO). The same thing applies when comparing

the corresponding 16- and 24-CPU timings with the 8-CPU

timings reported in Table 3. Given that the total job times show

a high parallel efficiency, not surprisingly so do the individual

timings for the various job steps. The least parallel-efficient step

is the bin sort (the step that involves the most I/O).

Table 4 reports parallel UMP2 energy timings for the Finland

trityl radical[16] (Fig. 3) ran on 48-, 72- and 96-CPUs of the Star

of Arkansas, a University-wide resource with 157 dual-proces-

sor nodes, each containing two 2.65 GHz Xeon E5430 quad-

core processors, 16 GB RAM and at least 250 GB scratch stor-

age. The internal network utilizes Myrinet which is faster than

Gigabit Ethernet and has a much lower latency. We used two

different basis sets, the PC-2 basis used for many other calcula-

tions reported in this paper (2334 basis functions) and the

larger aug-cc-pvtz basis (3613 basis functions). In addition to

the parallel scaling, these calculations show the scaling with

increasing basis set size for the same system size.

Using the aug-cc-pvtz basis results in severe linearly

dependency (the lowest eigenvalue of the overlap matrix was

9.01 � 10�10) and we suppressed (i.e., eliminated) all basis

function combinations with eigenvalues lower than 10�5. This

led to the removal of 167 basis function combinations. In all

the other calculations the limit on the lowest value of the

Figure 3. Structure of the Finland trityl radical (C40H39O6S12). [Color

figure can be viewed in the online issue, which is available at

wileyonlinelibrary.com.]

Table 5. SCF and MP2 energies for all the systems reported in this paper.

System Basis set Integral[a] thresh SCF[b] thresh ESCF EMP2[c]

(H2O)20 6–31G* 10�10/10�9 2 � 10�6 �1520.1377885 �1523.9339277

C16H28O6N6 6–311G** 10�10/10�9 2 � 10�6 �1398.5548925 �1403.0710703

C20H32N10O11 6–31G* 10�10/10�9 9 � 10�6 �2144.0670136 �2150.1610763

Yohimbine (C21H26N2O3) PC-2 10�10 7 � 10�7 �1144.2026976 �1448.6982211

Calix[4]arene (C32H32O4) cc-pvtz 10�10 6 � 10�6 �1530.2719323 �1536.3998656

C44H42N4 PC-2 10�11 3 � 10�8 �1908.8828226 �1916.8014598

TEMPO (C9H18NO) PC-2 10�10 3 � 10�7 �480.6978113 �482.6678844

CPh3 PC-2 10�10 1 � 10�7 �728.4669260 �731.2670428

DPPH (C18H12N5O6) PC-2 10�10 3 � 10�6 �1410.0376839 �1415.1649922

Dodecyl syringate PC-2 10�10 5 � 10�7 �1189.1492757 �1193.6913084

C30H18O4[d] PC-2 10�11 1 � 10�7 �1446.4337022 �1451.6914684

C42H50N2 PC-2 10�11 5 � 10�7 �1729.0447935 �1736.0714234

C40H39O6S12 PC-2 10�12/10�11 3 � 10�7 �6757.7433943 �6767.4467778

C40H39O6S12[e] aug-cc-pvtz 10�14/10�11 2 � 10�7 �6757.7741542 �6768.1276175

[a] If two integral thresholds are given, the first is for the SCF, the second for MP2; if only one threshold is given it is the same for both SCF and MP2.

[b] The RMS observed at convergence in the commutator of the Fock and the Density matrix (known as the Brillouin condition). [c] In all cases, core

orbitals (orbital energy < �3.0 au) were omitted from the calculation. [d] MP2 energy potentially less accurate for this system due to near-linearly de-

pendent basis set in combination with 5-byte integral packing scheme (the lowest eigenvalue of the overlap matrix was 3.87 � 10�6). [e] MP2 energy

potentially less accurate for this system due to linearly dependent basis set in combination with 5-byte integral packing scheme and basis function sup-

pression (the lowest eigenvalue of the overlap matrix was 9.01 � 10�10; 167 basis function combinations were suppressed).



overlap matrix for basis function suppression was 10�6 (low

enough so that all basis functions were kept).

As can be seen from Table 4, the parallel efficiency even on

96 CPUs is excellent. Comparing the total elapsed time on 96

CPUs with that on 48 CPUs, then for the smaller PC-2 basis

the factor is virtually 2.0, while for the larger aug-cc-pvtz basis

it is actually greater than 2.0. (This is due to the reduction in

elapsed time for the steps requiring the most I/O—Tbin and

particularly Tsort—with increasing number of CPUs on the

Myrinet network.) In the latter case, the estimated parallel effi-

ciency[17] of the SCF step was 93.2 on 96 CPUs. Note also that

this calculation (with more than 3600 basis functions) was

completed using only 4 GB RAM per CPU. It required two

passes for some of the integrals, and so the first half-transfor-

mation step took longer than it should have, but it demon-

strates just how little (relatively) computational resource is

needed even for very large jobs.

The observed scaling with increasing basis set size, which is

approximately O(N4), was also affected by the recalculation of

integrals. It is typically closer to O(N3.5).

Finally, Table 5 gives the SCF and MP2 energies for all sys-

tems calculated in this paper along with the integral and final

SCF convergence thresholds. The canonical MP2 energies

reported are essentially exact (within the limitations of the in-

tegral threshold) and represent benchmarks that can be used

to check future algorithmic developments and the accuracy of

more approximate methods such as RI-MP2.[18] The geometries

(Cartesian coordinates) used for each system are provided as

Supporting Information.

Conclusions

We have presented details of our full accuracy unrestricted

open-shell MP2 energy algorithm along with timings for sin-

gle-point UMP2 energies for a number of formally closed-shell

systems and long-lived radicals. The algorithm is both serially

efficient and has a high parallel efficiency even on large num-

bers of CPUs. Memory demands are modest, with calculations

involving up to 2000 basis functions typically requiring no

more than 2 GB RAM per CPU. The bottleneck for very large

calculations is likely to be disk storage of which there must be

sufficient to hold all the half-transformed integrals (four times

as many as for a corresponding RMP2 energy); jobs run on

multiple nodes can access the cumulative disk storage on all

nodes and so larger clusters can readily run jobs on systems

with several thousand basis functions. Jobs are likely to run

faster on larger numbers of nodes with less CPUs (cores) per

node than on smaller numbers of similarly configured nodes

with more CPUs per node due to the decrease in I/O resource

per CPU in the latter case. The unrestricted open-shell algo-

rithm compliments our existing closed-shell RMP2 code with

UMP2 energies taking between 1.5 and 3.0 times as long as

an RMP2 energy on a similar system.

Acknowledgments

This work was supported by the National Science Foundation under

grant number CHE-0911541, by the Mildred B. Cooper Chair at the

University of Arkansas and by Parallel Quantum Solutions. Acquisi-

tion of the Star of Arkansas supercomputer was supported in part

by the National Science Foundation under award number MRI-

0722625. The authors thank Prof. Peter Pulay for useful discussions

and Dr. Tomasz Janowski for help with some of the calculations,

particularly those run on the Star of Arkansas.

[1] PQS version 4.0, beta, Parallel Quantum Solutions, 2013 Green

Acres Road, Suite A, Fayetteville, Arkansas 72703. Available at

[email protected], http://www.pqs-chem.com.

[2] J. Baker, K. Wolinski, M. Malagoli, D. Kinghorn, P. Wolinski, G.

Magyarfalvi, S. Saebo, T. Janowski, P. Pulay, J Comput Chem 2009, 30,

317.

[3] P. Pulay, S. Saebo, K. Wolinski, Chem Phys Lett 2001, 344, 543.

[4] J. Baker, P. Pulay, J Comput Chem 2002, 23, 1150.

[5] S. Saebo, J. Almlof, Chem Phys Lett 1989, 154, 83.

[6] M. Yoshimine, J Comp Phys 1973, 11, 333.

[7] Y. Jung, R. C. Lochan, A. D. Dutoi, M. Head-Gordon, J Chem Phys 2004,

121, 9793.

[8] S. Grimme, J. Chem. Phys. 2003, 118, 9095.

[9] M. Gomberg, J Am Chem Soc 1900, 22, 757.

[10] F. Montanari, S. Quici, H. H. Riyad, T. T. Tidwell, Encyclopedia of

Reagents for Organic Synthesis; Wiley, 2005.

[11] J. Tamuliene, A. Tamulis, J. Kulys, Nonlinear Anal: Modell Control 2004,

9, 185.

[12] R. Born, W. Fischer, D. Heger, B. Tokarczyk, J. Wirz, Photochem Photo-

biol Sci 2007, 6, 552.

[13] A. Rajca, K. Shiraishi, M. Pink, S. Rajca, J Am Chem Soc 2007, 129,

7232.

[14] (a) J. J. P. Stewart, J Comput Chem 1989 10 209, 221; (b) J. J. P. Stew-

art, J Comput Chem 1990, 11, 543.

[15] (a) F. Jensen, J Chem Phys 2001, 115, 9113; (b) F. Jensen, J Chem Phys

2002, 116, 3502.

[16] I. Dhimitruka, M. Velayutham, A. A. Bobko, V. V. Khramtsov, F. A. Villa-

mena, C. M. Hadad, J. L. Zweier, Bioorg Med Chem Lett 2007, 17,

6801.

[17] J. Baker, M. Shirel, Parallel Comput 2000, 26, 1011.

[18] O. Vahtras, J. E. Almlof, M. W. Feyereisen, Chem Phys Lett 1993, 213,

514.

Received: 9 June 2011Revised: 30 July 2011Accepted: 30 July 2011Published online on 27 August 2011



An efficient parallel algorithm for the calculation of unrestricted canonical MP2 energies

Documents

Transcript of An efficient parallel algorithm for the calculation of unrestricted canonical MP2 energies