An efficient parallel algorithm for the calculation of unrestricted canonical MP2 energies
Transcript of An efficient parallel algorithm for the calculation of unrestricted canonical MP2 energies
DOI: 10.1002/jcc.21924
An Efficient Parallel Algorithm for the Calculation ofUnrestricted Canonical MP2 Energies
Jon Baker*[a,b] and Krzysztof Wolinski[a,c]
We present details of our efficient implementation of full accuracy
unrestricted open-shell second-order canonical Møller–Plesset
(MP2) energies, both serial and parallel. The algorithm is based on
our previous restricted closed-shell MP2 code using the Saebo–
Almlof direct integral transformation. Depending on system details,
UMP2 energies take from less than 1.5 to about 3.0 times as long as
a closed-shell RMP2 energy on a similar system using the same
algorithm. Several examples are given including timings for some
large stable radicals with 90þ atoms and over 3600 basis functions.VC 2011Wiley Periodicals, Inc. J Comput Chem 32: 3304–3312, 2011
Keywords: UMP2 energies � Saebo–Almlof direct integral transformation � parallel algorithmIntroduction
Whenever a new computational algorithm is developed for
the calculation of some theoretical quantity (a new implemen-
tation of an existing method, as is the case in this work, or an
initial implementation of a new quantity), it is nearly always
done first for the closed-shell case. Not only does this provide
considerable simplification in the actual coding, but as the
vast majority of chemical systems are closed-shell, then the
resulting algorithm has immediate maximum applicability. Not
infrequently, the unrestricted open-shell version of the new
algorithm is either not developed at all, or done much later.
Although most of the algorithmic difficulties have already
been overcome during the closed-shell development, the
open-shell version—often dismissed as a ‘‘trivial’’ modification
to the closed-shell code—may be far from straightforward in
addition to involving a lot more work.
An unrestricted version of a method is important, not only for
genuine open-shell systems but also for situations where the
formally closed-shell wavefunction is UHF unstable, that is, the
energy can be lowered by breaking the spatial symmetry of the
a and b molecular orbitals. Closed-shell versions of virtually any
wavefunction rarely describe bond-breaking adequately, and
the unrestricted version is usually much better. Even situations,
where bonds are stretched, for example, in transition states,
may be better described at the unrestricted level, depending on
the degree of spin contamination in the resulting wavefunction.
This article presents some details of the unrestricted open-
shell canonical second-order canonical Møller–Plesset (MP2)
energy module as implemented in the PQS program pack-
age.[1,2] The basic design of the algorithm follows that of
the closed-shell version[3,4] and is again based on the Saebo–
Almlof direct integral transformation.[5] However, we have
modified certain aspects of the algorithm, both serial and par-
allel, particularly the resorting of the initial half-transformed
integrals (the bin-sort step).
The unrestricted MP2 energy can be written as
EUMP2 ¼ Eaa þ Eab þ Ebb (1)
where
Eaa ¼X
I�J
eIJ
¼X
A>B all MOs a�spin
½ðIAjJBÞ � ðIBjJAÞ�2=ðeI þ eJ � eA � eBÞ
(1a)
Eab ¼X
I;J
eIJ ¼X
A;B I;A a�spin J;B b�spin
ðIAjJBÞ2=ðeI þ eJ � eA � eBÞ
(1b)Ebb ¼
X
I�J
eIJ
¼X
A>B all MOs b�spin
½ðIAjJBÞ � ðIBjJAÞ�2=ðeI þ eJ � eA � eBÞ
(1c)
Here indices I and J represent occupied molecular orbitals
(MOs), A and B represent virtuals, (IA|JB) represents a fully
transformed MO-integral, e are the UHF orbital energies, and
eIJ is a so-called pair energy, which is the contribution to
the UMP2 energy from the occupied orbital pair I,J. This
form of the UMP2 energy requires an integral transformation
from the original AO- to the MO-basis, and in the Saebo–
Almlof method, this is accomplished via two half-transforma-
tions, involving first the two occupied MOs and then the
two virtuals. The AO-integral (lm|kr) can be regarded as the
(mr) element of a generalized (exchange) matrix Xlk, and
the two half-transformations can be formulated as matrix
multiplications
[a] J. Baker, K. Wolinski
Parallel Quantum Solutions, 2013 Green Acres Road, Suite A, Fayetteville,
Arkansas 72703
E-mail: [email protected]
[b] J. Baker
Department of Chemistry, University of Arkansas, Fayetteville, Arkansas
72701
[c] K. Wolinski
Department of Chemistry, Maria Curie-Sklodowska University, Lublin, Poland
Additional Supporting Information may be found in the online version
of this article.
Journal of Computational ChemistryVC 2011 Wiley Periodicals, Inc.3304
ORIGINAL ARTICLES
ðlIjkJÞ ¼ YlkIJ ¼ P
m;rCTmIX
lkmrCrJ ¼ CT
I XlkCJ
ðAIjBJÞ ¼ ZIJAB ¼
Pl;k
CTlAY
IJlkCkB ¼ CT
AYIJCB
(2)
The disadvantage of this approach—that the full permuta-
tional symmetry of the AO-integrals cannot be utilized—is
more than offset by the computational efficiency of the matrix
formulation, especially for larger systems. A further gain is that
the sparsity of the AO-integral matrix is such that entire rows
and columns are often zero; these can be eliminated, and the
AO-matrix compressed to give smaller, dense matrices that can
use highly efficient dense matrix multiplication routines[3] as
opposed to the usual sparse matrix techniques that are much
less efficient.
As shown in eq. (2), the second half-transformation requires a
reordering of the half-transformed integrals from YlkIJ (i.e., all I,J
indices for each l,k pair) to YIJlk (i.e., all l,k indices for each I,J
pair), and in the closed-shell algorithm, this was accomplished
by a modified Yoshimine bin sort.[6] This step has been further
modified in the UMP2 algorithm as will be described later.
In order for the algorithm to work, there must be enough
disk storage to hold all the nonzero half-transformed inte-
grals (lI|kJ). These are stored as 5-byte integers[2,4] which
reduces the total storage required by 37.5% compared with
a full 8-byte floating point real. Note that, in parallel, the
aggregate disk storage over all nodes is available, and so the
parallel algorithm can handle somewhat larger systems than
the serial algorithm. Memory demands are modest,
although—for reasons of efficiency—the AO integrals are
calculated in batches over complete shells, and this requires
a minimum of s2N2 double words (1 double word ¼ 8 bytes)
of fast memory (where N is the number of basis functions
and s is the shell size, i.e., s ¼ 1 for s shells, s ¼ 3 for p
shells, s ¼ 5 for (spherical) d shells, etc.); if this is not avail-
able, then only those integrals that can be successfully
handled in the available memory are actually utilized, and
that batch is recalculated as many times as necessary. In
practice, calculations involving up to around 2000 basis func-
tions can readily be carried out with only modest memory
demands (say 2 GB RAM per CPU), and no recalculation of
AO integrals unless there are high angular momentum basis
functions in the basis set.
The UMP2 Algorithm
The major complication in the open-shell algorithm is that the
presence of two sets of MOs, corresponding to a and b spin,
results in three different sets of final transformed integrals
(aa|aa), (aa|bb), and (bb|bb) (as shown in eqs. (1a)–(1c), respec-
tively) compared with just one in the closed-shell case (where
the spatial parts of the a- and b-spin orbitals are identical).
Additionally, the I ¼ a, J ¼ b and I ¼ b, J ¼ a half-transformed
ab integrals are stored separately. The way this is handled is to
do all three first half-transformations together, as the raw
atomic integrals are computed, giving three half-transformed
integral files at the same time. (The disk storage requirements
are thus more-or-less four times those of a similar closed-shell
calculation, with the ab integral file being twice the size of the
other two.) Once the half-transformed integrals have all been
stored on disk, the second half-transformations are carried out
essentially independently on each file separately.
If there is sufficient storage, then all the half-transformed
integrals from a single file can be read, sorted, and written to
disk before any of them are further transformed. If the half-
transformed integral file is too large to effectively duplicate,
then the second half-transformation can be done in multiple
passes, doing as many I,J indices in each pass as can be suc-
cessfully sorted and stored in the available disk storage. It
turned out that the time penalty for doing multiple passes
compared with a single pass in the original RMP2 algorithm
was negligible for small numbers of passes.[4] Furthermore,
when the algorithm was first developed (the early 2000s), it
was awkward to have files larger than 2 GB on the 32-bit archi-
tecture machines available at that time, and so the second half-
transformation was limited in each pass to a subset of the I,J
pairs for which all sorted integrals could be written to a single
2 GB file. (Actually for a direct access file, it was not the overall
size of the file that counted but the size and total number of
the records written to it.) If necessary (which was often the
case), the first half-transformed integrals were written to multi-
ple files. This file size limitation has of course been lifted for the
foreseeable future on 64-bit architecture machines.
In the UMP2 algorithm, we have added an ‘‘in-core’’ option
which limits the number of I,J pairs in each pass of the second
half-transformation step to those that can be sorted and trans-
formed in the available memory. In the serial algorithm, this
eliminates the bin-sort file altogether and hence requires no
additional disk storage, at the cost of potentially multiple
reads of the half-transformed integral file (which is, however,
read sequentially). This is faster than the standard ‘‘bin-sort’’
algorithm if all the I,J pairs can be processed in a relatively
small number of reads of the integral file, but becomes more
and more inefficient as the system size increases. However, it
does allow calculations on systems that are so large relative to
the available resources that there is little or no disk storage
left after computing and saving all the first half-transformed
integrals. Note that once a specific contribution to the total
MP2 energy has been computed, say Eaa [eq. (1a)], the corre-
sponding integral file is deleted, thus freeing up disk storage
for the bin sort. Consequently, the less efficient ‘‘in-core’’ algo-
rithm should normally be confined to the calculation of one
energy term only, if it is used at all.
Perhaps the biggest modification in the UMP2 algorithm has
been to the bin-sort step. In the original RMP2 algorithm, each
sorted 5-byte integral was stored together with its two AO-
indices (as 2-byte integers), and so each integral in effect
required 9 bytes of storage. This was done primarily, because
the magnitude of each integral was checked as it was read
from the half-transformed integral file and integrals whose
contribution would be below the threshold were discarded.
The two AO-indices are of course common for all I,J pairs in
that record on the half-transformed integral file and so, if no
Efficient Algorithm for the Calculation of Unrestricted Canonical MP2 Energies
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 3305
integrals are discarded, only need to be written once. An array
for the AO indices is now kept and written separately from the
5-byte integrals. The bulk of the very small integrals are elimi-
nated at the AO-level by a highly efficient integral neglect
scheme based on localized MOs,[3] and the number that can
be further neglected from the half-transformed integrals is, rel-
atively speaking, not significant. The additional work involved
in handling those few integrals that could have been
neglected is more than offset by the subsequent simplification
in the code and the considerable reduction in disk storage for
the sorted integrals. Note that the 5-byte integral storage
procedure used originally[4] was subsequently modified so as
to remove all possibility of integer overflow,[2] and it is the
modified procedure that is used here.
The size of each record on the bin-sort file (i.e., the number
of I,J bins written at any one time) is carefully selected to cor-
respond to the number of I,J indices that can be read back
and transformed in the available memory, that is, to the num-
ber of YIJ matrices that can be handled in core at once. In the
parallel algorithm, the number of I,J pairs is further divided
into the number of slaves and after the half-transformed inte-
grals on each slave’s local disk have been sorted, they are sent
to the slave that has been assigned to deal with that particular
batch of I,J pairs, which subsequently writes them to a
local bin-sort file prior to transforming them and forming the
corresponding pair energies.
Results and Discussion
Table 1 presents serial timings (in minutes) for computation of
the MP2 energy (starting from converged MOs from the
preceding SCF step) for a number of formally closed-shell-
molecular systems, ranging in size from a few hundred to over
two thousand basis functions. Their structures are shown sche-
matically in Figure 1. In all cases, core orbitals (defined as
those MOs with orbital energy below �3.0 au) were omitted
from the calculation. Both RMP2 and UMP2 calculations were
performed on each system. This enables a direct comparison
of the performance of the closed-shell and open-shell
algorithms, as well as a check on the accuracy of the UMP2
algorithm, that is, is the answer right?
There are, in terms of computing the MP2 energy, two limit-
ing cases: (1) small molecules with (relatively) large basis sets
and (2) large molecules with (relatively) small basis sets. In the
first case (the most common), calculating the MP2 energy is
dominated by computing the AO-integrals and the first half-
transformation; in the second, it is dominated by the bin sort
and the second half-transformation. In the UMP2 algorithm,
because there are three different integral types (and effectively
four times as many half-transformed integrals), then the bin
sort and second half-transformation will take around three to
four times as long as in the RMP2 algorithm (which has only a
single integral type). The first half-transformation step, on the
Table 1. RMP2, UMP2, and SOS-MP2 single-point energy timings (in minutes).[a]
Type NBas NCorr NVirt RAM[b]
First half-trans. Second half-trans.
TotalTint Ttran Tbin Tsort Ttran
(H2O)20 MP2/6–31G*
RMP2 320 80 280 320 2.0 1.8 0.4 1.8 6.0
UMP2 320 80 280 320 1.8 4.1 1.5 0.6 5.5 13.5
SOS 1.8 2.9 0.6 0.2 3.0 8.7
C16H28N6O6 MP2/6–311G**
RMP2 672 79 565 400 21.2 11.2 1.9 10.7 45
UMP2 672 79 565 400 19.9 21.7 8.3 3.2 35.1 88
SOS 19.9 17.1 3.8 1.3 17.8 60
C20H32N10O11 MP2/6–31G*
RMP2 679 114 524 400 9.4 7.1 7.1 18.8 42
UMP2 679 114 524 400 9.4 19.8 16.7 12.6 64.0 123
SOS 10.1 13.2 8.5 5.8 31.7 70
C21H26N2O3 (yohimbine) MP2/PC-2
RMP2 1144 69 1049 1200 255 162 6.2 50.0 467
UMP2 1144 69 1049 1200 251 260 20.3 10.4 152 696
SOS 241 229 9.5 4.3 76.4 561
C32H32O4 (calix[4]arene) MP2/cc-pvtz[c]
RMP2 1528 92 1400 1200 117 140 4.4 167 428
UMP2 1528 92 1400 1200 119 249 16.0 20.7 628 1034
SOS 120 227 7.3 8.1 318 681
C44H42N4 MP2/PC-2[c]
RMP2 2028 119 1861 3000 440 509 17 649 1615
UMP2 2028 119 1861 3000 423 900 84 325 2500 4241
SOS 426 793 44 151 1297 2716
Timings reported are as follows (all timings elapsed): first half-trans.: Tint—time to compute raw AO-integrals; Ttran—time to carry out the integral trans-
formation; second half-trans.: Tbin—time to read in the half-transformed integrals, reorder indices and rewrite to bin-sort file; Tsort—time to read in reor-
dered integrals in 5-byte format and convert to full double precision; Ttran—time to carry out the integral transformation.
[a] Serial timings on a 2.26 GHz Intel Xeon E5520 quad-core X2 processor. [b] Memory request in MB. [c] C2h symmetry.
J. Baker and K. Wolinski
Journal of Computational Chemistry3306 http://wileyonlinelibrary.com/jcc
other hand, should certainly not take four times longer than it
does in the RMP2 algorithm, because calculating the AO-inte-
grals is a significant component, and this takes the same time
in both cases.
As system size increases even further, the second half-trans-
formation step will become dominant in essentially all cases
despite the formally higher scaling of the first quarter-transfo-
mation step (O(nN4) compared with O(n2N3) at worst for the
other steps, where n is the number of orbitals correlated, and
N is the number of basis functions), because increasing AO-in-
tegral neglect ultimately reduces the formal scaling of the first
step down to O(nN2).
Computing a UMP2 energy should therefore take, at worst,
no more than four times as long as the corresponding RMP2
energy for molecules of type (2), above, and in practice is
more typically a factor of around two, and potentially even
less the closer you are to the type (1) limit. This is borne out
by the timings shown in Table 1. Also shown in Table 1 are
timings for the scaled opposite-spin (SOS)-MP2 energy. This is
a procedure from Head-Gordon and coworkers[7] similar to the
earlier MP2 scaling approach of Grimme[8] who showed that
MP2 energies can be systematically improved by scaling
the opposite-spin and same-spin components of the MP2
energy separately. (The scaling factors determined by Grimme
Figure 1. Structures of the molecules used for the job timings reported in Table 1.
Efficient Algorithm for the Calculation of Unrestricted Canonical MP2 Energies
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 3307
were 6/5 and 1/3, respectively.) Arguments are advanced in
Ref. [7] for eliminating the smaller same-spin component entirely
and compensating by having a larger scaling factor on the
opposite-spin component. (The suggested scaling factor is 1.3.)
The statistical improvement over the standard MP2 energy
for SOS-MP2[7] is similar to that found by Grimme using his
two-component scaling.[8] However, SOS-MP2 has the advant-
age that the same-spin components do not need to be
computed at all. For a UMP2 energy, this means that only the
second term in eq. (1), the ab cross term given by eq. (1b),
survives leading to potential savings in both the first and, in
particular, the second half-transformation.
The ratios between the times needed to compute the RMP2
and UMP2 energies for the two largest systems in Table 1 are
greater than they ought to be based on the discussion given
above. (They should be a factor of around 2.0 or less rather
than, e.g., the almost 2.5 for calix[4]arene.) The reason is that
both systems have symmetry (C2h which has four symmetry
operations) which in the current code can be utilized in the
first half-transformation (by evaluating only symmetry-unique
AO integrals) but not in the second. Consequently, in compari-
son with a system with no usable symmetry, the first half-
transformation step only takes about a quarter of the time it
ought to (one divided by the number of symmetry opera-
tions), giving the second half-transformation step much more
weight. For very large systems, the ratio will increase, as the
second half-transformation becomes dominant.
Parallel UMP2 energy timings are given in Table 2 for a
number of long-lived organic radicals, the structures of which
are shown in Figure 2. Triphenylmethyl was the first radical
ever discovered (by Gomberg[9]); TEMPO is widely used as a
radical trap, as a structural probe for biological systems in
conjunction with electron spin resonance (ESR) spectroscopy,
as a reagent in organic synthesis, and as a mediator in con-
trolled free-radical polymerization[10]; DPPH is also widely used
as a radical scavenger; the dodecyl syringate radical has been
proposed as a candidate for ESR molecular quantum com-
puters,[11] and the 6,11-diphenyloxy-5,12-naphthacenequinone
triplet diradical has been postulated as an intermediate in the
photoinduced phenyl ring transfer.[12] The largest system is an
aminyl diradical derived from m-phenylene synthesized by the
Rajca group.[13] The five collinearly fused rings are reported to
be approximately planar with the 4-tert-butylphenyl group
nearly perpendicular to them. As this system (94 atoms, 2020
basis functions) taxes the resources of the cluster we are
using, particularly the disk storage, Cs symmetry was enforced
to reduce the number of half-transformed integrals. All calcula-
tions used the semiempirical PM3[14] optimized geometries. To
give at least some idea of the scaling of the algorithm with
system size, we used the same basis set in all cases, Jensen’s
polarization-consistent PC-2 basis.[15] This is a TZ-type basis set
with 3s2p1d on hydrogen and 4s3p2d1f on first-row elements
and is fairly typical of the size of basis used in serious MP2
calculations.
Table 2. Parallel UMP2/PC-2 single-point energy timings (in minutes)[a] for some long-lived organic radicals ran utilizing one CPU per node.
First half-trans. Second half-trans.
NCPU NAtom NBas NCorr NVirt RAM[b] Ttran Tbin Tsort Ttran Total
2,2,6,6-Tetramethylpiperidine-1-oxyl ‘‘TEMPO’’ (C9H18NO) Cs4 29 582 a33 538 720 15.8 0.93 0.18 1.16 18.1
8 b32 539 7.9 1.43 0.12 0.59 10.1
12 5.3 1.77 0.13 0.39 7.6
Triphenylmethyl (CPh3) C34 34 780 a46 715 480 53.0 2.29 0.54 5.30 61.3
8 b45 716 26.6 1.15 0.29 2.67 30.8
12 17.7 0.79 0.20 1.79 20.5
Diphenylpicrylhydrazyl ‘‘DPPH’’ (C18H12N5O6) C14 41 1038 a73 936 720 135.7 16.5 6.8 30.8 190.1
8 b72 937 67.7 9.4 2.9 15.4 95.7
12 45.1 6.9 1.5 10.3 63.9
Dodecyl syringate (C21H33O5) C14 59 1242 a74 1142 1500 115.1 27.8 14.5 55.2 212.3
8 b73 1143 57.8 13.1 6.9 27.7 105.8
12 38.4 10.7 3.0 18.5 70.8
6,11-Diphenyloxy-5,12-naphthacenequinone triplet diradical (C30H18O4) C14 52 1272 a82 1156 1800 274.7 31.9 16.2 70.7 394.5
8 b80 1158 137.5 16.3 8.1 35.4 197.7
12 91.2 11.7 5.7 23.6 132.7
Triplet aminyl diradical (C42H50N2) Cs4 94 2020 a115 1861 3500 579.2 109.7 88.1 568.6 1349
8 b113 1863 289.9 57.2 47.9 284.6 681.8
12 192.7 40.1 33.2 189.7 459.4
Timings reported are as follows (all timings elapsed): first half-trans.: Ttran—time to compute raw AO-integrals and carry out the integral transformation
(not reported separately in parallel algorithm); second half-trans: Tbin—time to read in the half-transformed integrals, reorder indices and rewrite to bin-
sort file; Tsort—time to read in reordered integrals in 5-byte format and convert to full double precision; Ttran—time to carry out the integral
transformation.
[a] Parallel timings on a 3.0 GHz Intel Pentium D930 dual-core processor. [b] Memory request per CPU in MB.
J. Baker and K. Wolinski
Journal of Computational Chemistry3308 http://wileyonlinelibrary.com/jcc
The jobs were run on an at least 5-year-old PC-based cluster
with a single 3.0 GHz Intel Pentium D930 dual-core processor,
4 GB of RAM and 300 GB of striped RAID0 scratch storage per
node. Communication between the nodes was via Gigabit
Ethernet. Despite its age, the clock speed of each processor
on this cluster is comparable with the best modern PCs,
because for the past several years, machines have been get-
ting ‘‘faster’’ overall not by increasing the clock speed of each
individual CPU but by incorporating more and more CPUs
(cores) per processor. At the time of writing multicore process-
ors including up to 12 CPUs are readily available. The configu-
ration of this older cluster is actually advantageous for MP2
jobs, as modern systems with the same number of CPUs—we
utilized twelve nodes for a total of 24 CPUs—generally have
far fewer nodes and consequently less resource per CPU than
older clusters which have less CPUs per node. For example,
Figure 2. Structures of the radicals used for the job timings reported in Tables 2 and 3.
Efficient Algorithm for the Calculation of Unrestricted Canonical MP2 Energies
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 3309
running a job over all 12 nodes allows the aggregate disk
storage on each node to be utilized; a more modern system
with all CPUs on a single node would require 12 times the
disk storage per node to maintain the same amount of disk
storage per CPU. The same would apply of course to memory
(RAM).
The situation is even worse for I/O as modern disk drives,
despite their increased capacity, typically still have only a sin-
gle I/O controller per drive. Thus, with the same disk configu-
ration, the amount of I/O resource per CPU decreases in direct
proportion to the increase in the number of CPUs per node.
As MP2 jobs involve a lot of I/O, they are likely to benefit sig-
nificantly when run over many nodes compared with running
over just a few nodes with much less I/O capability per CPU.
The single-point UMP2 energies reported in Table 2 were
run utilizing one of the two available CPUs per node. Table 3
reports the same systems run on the same number of nodes
but using both available CPUs. This doubles the number of
CPUs from a maximum of 12 to a maximum of 24, but halves
the I/O resource per CPU. The effect of the reduced I/O
capacity can be seen by comparing the 8-CPU timings
between the two tables (one ran on eight nodes using one
CPU per node, the other ran on four nodes using two CPUs
per node). Apart from the smallest system (TEMPO which only
has 582 basis functions) the total job time increases signifi-
cantly between the one- and two-CPU per node runs by up to
40%. Every step that involves I/O takes longer when both
CPUs are used. Both the first half-transformation time (Ttran,
which involves I/O when the integrals are written to disk), and
particularly, the second half-transformation (Tbin and Tsort) are
affected. That the effect is due to I/O only can be seen by
the timing for the integral transformation in the second half-
Table 3. UMP2/PC-2 single-point energy timings (in minutes)[a] for some long-lived organic radicals ran utilizing two CPUs per node.
First half-trans. Second half-trans.
NCPU NAtom NBas NCorr NVirt RAM[b] Ttran Tbin Tsort Ttran Total
2,2,6,6-Tetramethylpiperidine-1-oxyl ‘‘TEMPO’’ (C9H18NO)Cs8 29 582 a33 538 720 8.30 0.31 0.12 0.59 9.4
16 b32 539 4.15 0.42 0.09 0.30 5.0
24 2.78 0.52 0.09 0.20 3.6
Triphenylmethyl (CPh3) C38 34 780 a46 715 480 29.5 3.15 0.48 2.69 36.0
16 b45 716 14.7 0.94 0.24 1.35 17.4
24 9.8 0.75 0.17 0.91 11.8
Diphenylpicrylhydrazyl ‘‘DPPH’’ (C18H12N5O6) C18 41 1038 a73 936 720 71.6 24.9 13.7 15.5 126.0
16 b72 937 35.6 14.2 5.6 7.8 63.5
24 23.8 9.6 2.3 5.2 41.1
Dodecyl syringate (C21H33O5) C18 59 1242 a74 1142 1500 60.9 38.0 21.7 27.9 148.9
16 b73 1143 30.3 20.5 11.8 14.0 76.9
24 20.3 14.6 7.9 9.3 52.5
6,11-Diphenyloxy-5,12-naphthacenequinone triplet biradical (C30H18O4) C18 52 1272 a82 1156 1800 150.1 45.2 26.2 35.6 258.2
16 b80 1158 75.2 22.8 12.8 17.8 129.1
24 49.9 16.4 9.1 11.9 87.7
See comments re(garding) timing details under Table 2.
[a] Parallel timings on a 3.0 GHz Intel Pentium D930 dual-core processor. [b] Memory request per CPU in MB.
Table 4. UMP2 single-point energy timings (in minutes)[a] for the Finland trityl radical ran on the Star of Arkansas utilizing two CPUs per node.
First half-trans. Second half-trans.
NCPU NAtom NBas NCorr NVirt RAM[b] Ttran Tbin Tsort Ttran Total
Finland trityl radical (C40H39O6S12) PC-2 basis set C348 97 2334 a154 2074 4000 106.5 88.2 81.3 71.8 349.1
72 b153 2075 71.0 65.9 42.6 47.9 228.6
96 53.3 54.1 30.9 36.0 174.9
Finland trityl radical (C40H39O6S12) aug-cc-pvtz basis set[c] C348 97 3613 a154 3186 4000 1638 179.6 240.5 260.3 2331
72 b153 3187 1093 141.4 151.4 173.6 1569
96 820 87.9 99.8 130.4 1141
See comments re(garding) timing details under Table 2.
[a] Parallel timings on a 2.65 GHz Xeon E5430 quad-core processor. [b] Memory request per CPU in MB. [c] 167 basis function pairs suppressed due to
linearly dependent basis set.
J. Baker and K. Wolinski
Journal of Computational Chemistry3310 http://wileyonlinelibrary.com/jcc
transformation step (Ttran) which is the timing for the final
transformation only and involves no I/O; this is essentially the
same in both cases.
The job time for TEMPO actually decreases with apparently
decreasing I/O resource. Alone among the systems studied,
TEMPO shows an increasing elapsed time for the bin sort
(Tbin) with increasing number of CPUs. This strongly suggests
that communication overhead is the bottleneck here with
insufficient data sent to compensate for this. The decrease in
time when both CPUs per node are used is likely due to the
decrease in internode communication.
Note that the largest job (the aminyl diradical with 2020 ba-
sis functions) could not be run using both CPUs per node, as
it required more than 2 GB memory per CPU to run efficiently.
We actually gave this job 3.5 GB per CPU (see Table 2), far
more than the minimum needed.
As can be seen from the timings reported in both Tables 2
and 3, the parallel efficiency of the UMP2 algorithm, at least up
to 24 CPUs, is very high. Considering Table 2, then with 100%
parallel efficiency the total job time on 8 CPUs should be half,
and that on 12 CPUs should be a third, of the job time reported
on 4 CPUs. This is close to being the case for all jobs except
the smallest (TEMPO). The same thing applies when comparing
the corresponding 16- and 24-CPU timings with the 8-CPU
timings reported in Table 3. Given that the total job times show
a high parallel efficiency, not surprisingly so do the individual
timings for the various job steps. The least parallel-efficient step
is the bin sort (the step that involves the most I/O).
Table 4 reports parallel UMP2 energy timings for the Finland
trityl radical[16] (Fig. 3) ran on 48-, 72- and 96-CPUs of the Star
of Arkansas, a University-wide resource with 157 dual-proces-
sor nodes, each containing two 2.65 GHz Xeon E5430 quad-
core processors, 16 GB RAM and at least 250 GB scratch stor-
age. The internal network utilizes Myrinet which is faster than
Gigabit Ethernet and has a much lower latency. We used two
different basis sets, the PC-2 basis used for many other calcula-
tions reported in this paper (2334 basis functions) and the
larger aug-cc-pvtz basis (3613 basis functions). In addition to
the parallel scaling, these calculations show the scaling with
increasing basis set size for the same system size.
Using the aug-cc-pvtz basis results in severe linearly
dependency (the lowest eigenvalue of the overlap matrix was
9.01 � 10�10) and we suppressed (i.e., eliminated) all basis
function combinations with eigenvalues lower than 10�5. This
led to the removal of 167 basis function combinations. In all
the other calculations the limit on the lowest value of the
Figure 3. Structure of the Finland trityl radical (C40H39O6S12). [Color
figure can be viewed in the online issue, which is available at
wileyonlinelibrary.com.]
Table 5. SCF and MP2 energies for all the systems reported in this paper.
System Basis set Integral[a] thresh SCF[b] thresh ESCF EMP2[c]
(H2O)20 6–31G* 10�10/10�9 2 � 10�6 �1520.1377885 �1523.9339277
C16H28O6N6 6–311G** 10�10/10�9 2 � 10�6 �1398.5548925 �1403.0710703
C20H32N10O11 6–31G* 10�10/10�9 9 � 10�6 �2144.0670136 �2150.1610763
Yohimbine (C21H26N2O3) PC-2 10�10 7 � 10�7 �1144.2026976 �1448.6982211
Calix[4]arene (C32H32O4) cc-pvtz 10�10 6 � 10�6 �1530.2719323 �1536.3998656
C44H42N4 PC-2 10�11 3 � 10�8 �1908.8828226 �1916.8014598
TEMPO (C9H18NO) PC-2 10�10 3 � 10�7 �480.6978113 �482.6678844
CPh3 PC-2 10�10 1 � 10�7 �728.4669260 �731.2670428
DPPH (C18H12N5O6) PC-2 10�10 3 � 10�6 �1410.0376839 �1415.1649922
Dodecyl syringate PC-2 10�10 5 � 10�7 �1189.1492757 �1193.6913084
C30H18O4[d] PC-2 10�11 1 � 10�7 �1446.4337022 �1451.6914684
C42H50N2 PC-2 10�11 5 � 10�7 �1729.0447935 �1736.0714234
C40H39O6S12 PC-2 10�12/10�11 3 � 10�7 �6757.7433943 �6767.4467778
C40H39O6S12[e] aug-cc-pvtz 10�14/10�11 2 � 10�7 �6757.7741542 �6768.1276175
[a] If two integral thresholds are given, the first is for the SCF, the second for MP2; if only one threshold is given it is the same for both SCF and MP2.
[b] The RMS observed at convergence in the commutator of the Fock and the Density matrix (known as the Brillouin condition). [c] In all cases, core
orbitals (orbital energy < �3.0 au) were omitted from the calculation. [d] MP2 energy potentially less accurate for this system due to near-linearly de-
pendent basis set in combination with 5-byte integral packing scheme (the lowest eigenvalue of the overlap matrix was 3.87 � 10�6). [e] MP2 energy
potentially less accurate for this system due to linearly dependent basis set in combination with 5-byte integral packing scheme and basis function sup-
pression (the lowest eigenvalue of the overlap matrix was 9.01 � 10�10; 167 basis function combinations were suppressed).
Efficient Algorithm for the Calculation of Unrestricted Canonical MP2 Energies
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 3311
overlap matrix for basis function suppression was 10�6 (low
enough so that all basis functions were kept).
As can be seen from Table 4, the parallel efficiency even on
96 CPUs is excellent. Comparing the total elapsed time on 96
CPUs with that on 48 CPUs, then for the smaller PC-2 basis
the factor is virtually 2.0, while for the larger aug-cc-pvtz basis
it is actually greater than 2.0. (This is due to the reduction in
elapsed time for the steps requiring the most I/O—Tbin and
particularly Tsort—with increasing number of CPUs on the
Myrinet network.) In the latter case, the estimated parallel effi-
ciency[17] of the SCF step was 93.2 on 96 CPUs. Note also that
this calculation (with more than 3600 basis functions) was
completed using only 4 GB RAM per CPU. It required two
passes for some of the integrals, and so the first half-transfor-
mation step took longer than it should have, but it demon-
strates just how little (relatively) computational resource is
needed even for very large jobs.
The observed scaling with increasing basis set size, which is
approximately O(N4), was also affected by the recalculation of
integrals. It is typically closer to O(N3.5).
Finally, Table 5 gives the SCF and MP2 energies for all sys-
tems calculated in this paper along with the integral and final
SCF convergence thresholds. The canonical MP2 energies
reported are essentially exact (within the limitations of the in-
tegral threshold) and represent benchmarks that can be used
to check future algorithmic developments and the accuracy of
more approximate methods such as RI-MP2.[18] The geometries
(Cartesian coordinates) used for each system are provided as
Supporting Information.
Conclusions
We have presented details of our full accuracy unrestricted
open-shell MP2 energy algorithm along with timings for sin-
gle-point UMP2 energies for a number of formally closed-shell
systems and long-lived radicals. The algorithm is both serially
efficient and has a high parallel efficiency even on large num-
bers of CPUs. Memory demands are modest, with calculations
involving up to 2000 basis functions typically requiring no
more than 2 GB RAM per CPU. The bottleneck for very large
calculations is likely to be disk storage of which there must be
sufficient to hold all the half-transformed integrals (four times
as many as for a corresponding RMP2 energy); jobs run on
multiple nodes can access the cumulative disk storage on all
nodes and so larger clusters can readily run jobs on systems
with several thousand basis functions. Jobs are likely to run
faster on larger numbers of nodes with less CPUs (cores) per
node than on smaller numbers of similarly configured nodes
with more CPUs per node due to the decrease in I/O resource
per CPU in the latter case. The unrestricted open-shell algo-
rithm compliments our existing closed-shell RMP2 code with
UMP2 energies taking between 1.5 and 3.0 times as long as
an RMP2 energy on a similar system.
Acknowledgments
This work was supported by the National Science Foundation under
grant number CHE-0911541, by the Mildred B. Cooper Chair at the
University of Arkansas and by Parallel Quantum Solutions. Acquisi-
tion of the Star of Arkansas supercomputer was supported in part
by the National Science Foundation under award number MRI-
0722625. The authors thank Prof. Peter Pulay for useful discussions
and Dr. Tomasz Janowski for help with some of the calculations,
particularly those run on the Star of Arkansas.
[1] PQS version 4.0, beta, Parallel Quantum Solutions, 2013 Green
Acres Road, Suite A, Fayetteville, Arkansas 72703. Available at
[email protected], http://www.pqs-chem.com.
[2] J. Baker, K. Wolinski, M. Malagoli, D. Kinghorn, P. Wolinski, G.
Magyarfalvi, S. Saebo, T. Janowski, P. Pulay, J Comput Chem 2009, 30,
317.
[3] P. Pulay, S. Saebo, K. Wolinski, Chem Phys Lett 2001, 344, 543.
[4] J. Baker, P. Pulay, J Comput Chem 2002, 23, 1150.
[5] S. Saebo, J. Almlof, Chem Phys Lett 1989, 154, 83.
[6] M. Yoshimine, J Comp Phys 1973, 11, 333.
[7] Y. Jung, R. C. Lochan, A. D. Dutoi, M. Head-Gordon, J Chem Phys 2004,
121, 9793.
[8] S. Grimme, J. Chem. Phys. 2003, 118, 9095.
[9] M. Gomberg, J Am Chem Soc 1900, 22, 757.
[10] F. Montanari, S. Quici, H. H. Riyad, T. T. Tidwell, Encyclopedia of
Reagents for Organic Synthesis; Wiley, 2005.
[11] J. Tamuliene, A. Tamulis, J. Kulys, Nonlinear Anal: Modell Control 2004,
9, 185.
[12] R. Born, W. Fischer, D. Heger, B. Tokarczyk, J. Wirz, Photochem Photo-
biol Sci 2007, 6, 552.
[13] A. Rajca, K. Shiraishi, M. Pink, S. Rajca, J Am Chem Soc 2007, 129,
7232.
[14] (a) J. J. P. Stewart, J Comput Chem 1989 10 209, 221; (b) J. J. P. Stew-
art, J Comput Chem 1990, 11, 543.
[15] (a) F. Jensen, J Chem Phys 2001, 115, 9113; (b) F. Jensen, J Chem Phys
2002, 116, 3502.
[16] I. Dhimitruka, M. Velayutham, A. A. Bobko, V. V. Khramtsov, F. A. Villa-
mena, C. M. Hadad, J. L. Zweier, Bioorg Med Chem Lett 2007, 17,
6801.
[17] J. Baker, M. Shirel, Parallel Comput 2000, 26, 1011.
[18] O. Vahtras, J. E. Almlof, M. W. Feyereisen, Chem Phys Lett 1993, 213,
514.
Received: 9 June 2011Revised: 30 July 2011Accepted: 30 July 2011Published online on 27 August 2011
J. Baker and K. Wolinski
Journal of Computational Chemistry3312 http://wileyonlinelibrary.com/jcc