QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

11
QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters Alexander Moore , Alice C. Quillen Department of Physics and Astronomy, University of Rochester, Rochester, NY 14627, USA article info Article history: Received 20 July 2010 Received in revised form 31 January 2011 Accepted 30 March 2011 Available online 3 April 2011 Communicated by J. Makino Keywords: Celestial mechanics Symplectic integrators Acceleration of particles CUDA abstract We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graph- ics processing unit (GPU). The integrator identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is approximately as accurate as other hybrid symplectic integrators but is GPU accelerated. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction The field of solar system dynamics has a timeline of discoveries that is related to the computational power available (e.g., Morbidelli, 2001). As computational power has increased over time, so to has our ability to more accurately simulate incremen- tally more complex systems via the inclusion of more sophisticated physics or simply a greater number of particles. The interaction be- tween planets and planetesimals subsequent to formation is a well posed N-body problem but displays remarkable complexity even in the absence of collisions, including resonance capture, planetary migration, and heating and scattering of planetesimals. In this pa- per we attempt to more accurately simulate this system in parallel by using the increased computational power and device memory recently made more accessible on video graphics cards. Popular numerical integrators in the field of celestial mechan- ics, most notably SyMBA and MERCURY, have become mainstays due to their accuracy and speed. These software packages excel at integrations which only use a few massive objects, as is often the case in celestial mechanics problems. Drastic improvements in simulation speed would require entirely new integration methods or high levels of code optimization. One way to solve this problem is via the computational performance benefits of parallelization. Improvements via parallelization in celestial mechanics integrations can be achieved in two ways, either by increasing performance of a single simulation or alternatively by allowing groups of simulations to be run simultaneously. Running large groups of solar system integrations concurrently is possible on a range of hardware including clusters of computers as well as GPU’s, 1 and even on a single CPU with multiple cores via simple batch scripting. Alternatively, it is possible to increase the efficiency of a single simulation via parallelization under certain conditions which are related to the algorithms required and the simulation parameters (e.g. number of particles, run-time, etc.). Again, depending on the algorithms being implemented, paralleliz- ing code can yield anywhere from little benefit to tremendous advantages. In the case of solar system integrations, a variety of techniques can be used to re-express algorithms which are non-obvious candidates for parallelization, such as an order N (O(N)) Kepler’s equations solver, into better performers. A greater order computation, particularly O(N 2 ) and higher algorithms, often experience larger performance gains when parallelized, as is the case for a typical all-pairs force computation. Additionally, higher order computations allow N to be smaller yet still receive benefits from parallelization. Parallelizing single large N simulations is the method that the authors pursue here. However, we attempt to achieve better per- formance on a single low cost device rather than write code for 1384-1076/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.newast.2011.03.009 Corresponding author. Tel.: +1 315 404 5877; fax: +1 585 273 3237. E-mail addresses: [email protected], [email protected] (A. Moore), [email protected] (A.C. Quillen). URL: http://www.astro.pas.rochester.edu/~aquillen/. 1 See Eric Ford’s web page on Swarm-NG, http://www.astro.ufl.edu/eford/code/ swarm/docs/README.html. New Astronomy 16 (2011) 445–455 Contents lists available at ScienceDirect New Astronomy journal homepage: www.elsevier.com/locate/newast

Transcript of QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

Page 1: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

New Astronomy 16 (2011) 445–455

Contents lists available at ScienceDirect

New Astronomy

journal homepage: www.elsevier .com/locate /newast

QYMSYM: A GPU-accelerated hybrid symplectic integrator that permitsclose encounters

Alexander Moore ⇑, Alice C. QuillenDepartment of Physics and Astronomy, University of Rochester, Rochester, NY 14627, USA

a r t i c l e i n f o a b s t r a c t

Article history:Received 20 July 2010Received in revised form 31 January 2011Accepted 30 March 2011Available online 3 April 2011Communicated by J. Makino

Keywords:Celestial mechanicsSymplectic integratorsAcceleration of particlesCUDA

1384-1076/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.newast.2011.03.009

⇑ Corresponding author. Tel.: +1 315 404 5877; faxE-mail addresses: [email protected], a

(A. Moore), [email protected] (A.C. Quillen).URL: http://www.astro.pas.rochester.edu/~aquillen

We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graph-ics processing unit (GPU). The integrator identifies close approaches between particles and switches fromsymplectic to Hermite algorithms for particles that require higher resolution integrations. The integratoris approximately as accurate as other hybrid symplectic integrators but is GPU accelerated.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

The field of solar system dynamics has a timeline of discoveriesthat is related to the computational power available (e.g.,Morbidelli, 2001). As computational power has increased overtime, so to has our ability to more accurately simulate incremen-tally more complex systems via the inclusion of more sophisticatedphysics or simply a greater number of particles. The interaction be-tween planets and planetesimals subsequent to formation is a wellposed N-body problem but displays remarkable complexity even inthe absence of collisions, including resonance capture, planetarymigration, and heating and scattering of planetesimals. In this pa-per we attempt to more accurately simulate this system in parallelby using the increased computational power and device memoryrecently made more accessible on video graphics cards.

Popular numerical integrators in the field of celestial mechan-ics, most notably SyMBA and MERCURY, have become mainstaysdue to their accuracy and speed. These software packages excelat integrations which only use a few massive objects, as is oftenthe case in celestial mechanics problems. Drastic improvementsin simulation speed would require entirely new integrationmethods or high levels of code optimization. One way to solvethis problem is via the computational performance benefits of

ll rights reserved.

: +1 585 273 [email protected]

/.

parallelization. Improvements via parallelization in celestialmechanics integrations can be achieved in two ways, either byincreasing performance of a single simulation or alternatively byallowing groups of simulations to be run simultaneously.

Running large groups of solar system integrations concurrentlyis possible on a range of hardware including clusters of computersas well as GPU’s,1 and even on a single CPU with multiple cores viasimple batch scripting. Alternatively, it is possible to increase theefficiency of a single simulation via parallelization under certainconditions which are related to the algorithms required and thesimulation parameters (e.g. number of particles, run-time, etc.).Again, depending on the algorithms being implemented, paralleliz-ing code can yield anywhere from little benefit to tremendousadvantages. In the case of solar system integrations, a variety oftechniques can be used to re-express algorithms which arenon-obvious candidates for parallelization, such as an order N(O(N)) Kepler’s equations solver, into better performers. A greaterorder computation, particularly O(N2) and higher algorithms, oftenexperience larger performance gains when parallelized, as is thecase for a typical all-pairs force computation. Additionally, higherorder computations allow N to be smaller yet still receive benefitsfrom parallelization.

Parallelizing single large N simulations is the method that theauthors pursue here. However, we attempt to achieve better per-formance on a single low cost device rather than write code for

1 See Eric Ford’s web page on Swarm-NG, http://www.astro.ufl.edu/eford/code/swarm/docs/README.html.

Page 2: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

446 A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455

use on larger, costlier and more exclusive distributed computingclusters or super computers. We take advantage of the fact thatan inherently large number of transistors on a GPU are dedicatedto ensuring high performance of floating point operations. Thishardware advantage is by design – floating point computationsare the basis of graphics performance on a modern desktop. Theirfocus on floating point performance is not shared with consumerx86 architecture CPU’s, which have a large amount of their diespace and large percentage of transistors dedicated to the datacaching and flow control necessary for adequate performance ofthe many serial non-floating point data type operations that exe-cute on a given operating system.

With these design goals and hardware advantages and disad-vantages in mind, we have developed our integrator to work in aproblem space that is somewhat different from that of SyMBA orMERCURY. It is not intended as a higher performance small N inte-gration suite, but rather one that is interested in allowing large Nintegrations to be integrated many orbital times at reasonable wallclock times on a single desktop computer. Particular problems,including Kuiper Belt or asteroid belt dynamics (with mass), ringformation and planetesimal and debris disk dynamics are all tar-gets of this software.

Our code2 is written for Compute Unified Device Architecture(CUDA) enabled devices. CUDA, implemented on NVIDIA graphicsdevices, is a GPGPU (General-purpose computing on graphics pro-cessing units) architecture that allows a programmer to use C-likeprogramming language with extensions to code algorithms for exe-cution on the graphics processing unit. It provides a developmentenvironment and Application Programming Interfaces (APIs) forCUDA enabled GPUs specifically tailored for parallel compute pur-poses. It achieves this by exposing the hardware to the developerthrough a memory management model and thread hierarchy thatencourages both constant streaming of data as well as paralleliza-tion. We will discuss porting the code to other parallel computingenvironments below.

Some of the algorithms used in our code including all-pairsforce computations as well as parallel sums, scans and sorts aregenerically available from various sources, including the NVIDIACUDA developers SDK, the CUDPP library, both of which we willbe referenced in more detail below. Also, the Thrust library3 offersmany of the same algorithms in template form. In addition to ourown previous work on a single precision symplectic integrator inthe celestial mechanics setting, Capuzzo-Dolcetta et al. (2011) havedeveloped 2nd and 6th order symplectic GPU integrators for use inthe galactic setting.

2. A second order democratic heliocentric method symplecticintegrator for the GPU

Symplectic integrators are useful for planetary system integra-tions because they preserve an energy (or Hamiltonian) that isclose to the real value, setting a bound on the energy error duringlong integrations (Wisdom and Holman, 1991; Wisdom et al.,1996). See Yoshida (1990, 1993), Leimkuhler and Rich (2004) forreviews of symplectic integrators. We have modified the secondorder symplectic integrators introduced by Duncan et al. (1998),Chambers (1999), and created an integrator that runs in parallelon a GPU. We have chosen the democratic heliocentric method(Duncan et al., 1998) because the force from the central body isseparated from the integration of all the remaining particles andthe coordinates do not depend on the order of the particles.

2 QYMSYM is available for download at http://astro.pas.rochester.edu/�hashchar/�aquillen/qymsym.

3 http://code.google.com/p/thrust/.

Following a canonical transformation, in heliocentric coordi-nates and barycentric momenta (Wisdom et al., 1996) the Hamilto-nian of the system can be written

HðP;Q Þ ¼ HDftðPÞ þ HKepðP;Q Þ þ HIntðQ Þ ð1Þ

where

HDft ¼1

2m0

Xn

i¼1

Pi

����������2

ð2Þ

is a linear drift term and Pi is the barycentric momenta of particle i.Here m0 is the central particle mass. The second term HKep is thesum of Keplerian Hamiltonians for all particles with respect to thecentral body,

HKep ¼XN

i¼1

P2i

2mi� Gmim0

Q ij j

!ð3Þ

where Qi are the heliocentric coordinates and are conjugate to thebarycentric momenta. Here mi is the mass of the ith particle andG is the gravitational constant. The interaction term contains allgravitational interaction terms except those to the central body,

HInt ¼XN

i¼1

XN

j¼1;j–i

� Gmimj

2 Q i � Q j

�� �� : ð4Þ

A second order integrator advances with timestep s using evolutionoperators (e.g., Yoshida, 1990)

EDfts2

� �EInt

s2

� �EKepðsÞEInt

s2

� �EDft

s2

� �ð5Þ

The Keplerian advance requires order N computations but interac-tion term requires order N2 computations. However, encounterdetection is most naturally done during the Keplerian step and re-quires O(N2) computations if all particle pairs are searched for closeencounters. If there is no search for close encounters then theKeplerian and Interaction steps can be switched (Moore et al.,2008) reducing the number of computations.

Each of the evolution operators above can be evaluated in par-allel. The drift evolution operator requires computation of the sumof the momenta. We have implemented this using a parallel reduc-tion sum parallel primitive algorithm available with the NVIDIACUDA Software Development Kit (SDK) 1.1 that is similar to theparallel prefix sum (scan) algorithm (Harris et al., 2008).

The Keplerian step is implemented by computing f and g func-tions using the universal differential Kepler’s equation (Prussingand Conway, 1993) so that bound and unbound particles can bothbe integrated with the same routine (see Appendix A). The Keple-rian evolution step is also done on the GPU with each thread com-puting the evolution for a separate particle. Kepler’s equation isusually solved iteratively until a precision limit is achieved. How-ever, the Laguerre algorithm (Conway, 1986) (also see Chapter 2by Prussing and Conway (1993)) converges more rapidly than aNewton method and moreover converges regardless of the startingapproximation. We have found that the routine converges to thedouble precision limit (of order 10�16) in fewer than 6 iterationsindependent of initial condition. See the Appendix A for theprocedure.

The interaction terms are computed on the GPU with all N2

force pairs evaluated explicitly in parallel. The algorithm is basedon the algorithm described by Nyland et al. (2008). This algorithmtakes advantage of fast shared memory on board the GPU to simul-taneously compute all forces in a p � p tile of particle positions,where p is the number of threads chosen for the computation (typ-ically 128 or 256). The total energy is evaluated with a kernel

Page 3: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455 447

explicitly evaluating all N2 pair potential energy terms, similar tothat calculating all N2 forces.

After the change to heliocentric/barycentric coordinates, the po-sition of the first coordinate corresponds to the center of mass andcenter of momentum. The trajectory of this particle does not needto be integrated. However, it is convenient to calculate the energyusing all pair interactions including the central mass. The interac-tion term in the Keplerian part of the Hamiltonian can be com-puted at the same times as HInt if Q0 is set to zero. Consequentlywe set Q0 = P0 = 0 at the beginning of the computation. This isequivalent to working in the center of mass and momentum refer-ence frame. Because we would like to be able to quickly check thetotal energy, we have chosen to keep the first particle correspond-ing to the center of mass and momentum as the first element in theposition and velocity arrays. During computation of HInt we set m0

to zero so that force terms from the first particle are not computed.These are already taken into account in the evolution term corre-sponding to HKep. The mass is restored during the energy sum com-putation as all potential energy terms must be calculated explicitly.

The particle positions and velocities are kept on the GPU duringthe computation and transferred back into host or CPU accessiblememory to output data files. An additional vector of length equalto the number of particles is allocated in global memory on the de-vice to compute the momentum sums used in the drift step com-putation. We limit the number of host to device and device tohost memory transfers by persistently maintaining position andvelocity information for all particles on the device because fre-quent data transfer between the CPU and GPU will reduce overallperformance of the code. The maximum theoretical throughputof PCI-express 2.0 16x bus technology, the interlink between theCPU and GPU on Intel processor based motherboards, is 8 GB/swith a significant latency penalty. This sets an upper limit of thetotal number of particles to �107 based on several gigabytes ofGPU memory,4 but computation time on a single GPU for this num-ber of particles would make such simulations unrealistic.

While the transfer data rate limitations between the host anddevice are important, use of global device memory should alsobe monitored carefully. Depending on memory clock speed, widthof the memory interface and memory type, theoretical maximumglobal memory transfer throughputs are currently �150 GB/s. Thisis the case for the GT200(b) architecture Quadro FX 5800 and Ge-Force GTX280/285 GPUs in our cluster, which range from 140 to160 GB/s respectively. Despite enjoying a significantly greater datatransfer rate compared to that of the PCI-express interlink, the la-tency penalty of device memory is also quite large – from 400 to800 cycles. Therefore, explicit global memory access should be lim-ited whenever possible. Shared memory, which is basically a usercontrolled cache that can be accessed by all cores on an individualmultiprocessor, can be used to reduce this penalty. As describedabove, the interaction and energy kernels leverage shared memoryand benefit greatly. By streaming information from global memoryto shared memory, we are able to hide the latency of global mem-ory, further increasing the computation speed. Even when all glo-bal memory transactions of a warp issued to a multiprocessorcan be coalesced (executed simultaneously), the latency is at leastthe aforementioned several hundred cycles. Uncoalesced globalmemory access can be even more costly. However, it is not alwayspossible to write an algorithm to access shared memory in a sensi-ble manner, and limitations on the amount of shared memoryspace may force direct global memory access regardless. This isthe case for our GPU implementation of the solver for the universaldifferential Kepler’s equation.

4 The linked lists used in the collision detection routines to be discussed enforce alower realistic limit.

Though the maximum number of threads on the video cards weused was 512 for GT200 architecture cards and 1024 for GF100architecture cards, we found that restrictions on the number ofavailable registers on each multiprocessor limited all of the majorkernels to 128 or 256 threads per block. A more detailed review ofNVIDIA GPU hardware and programming techniques can be foundin the CUDA Programming Guide.

Our first parallel symplectic integrator necessarily ran in singleprecision (Moore et al., 2008) as graphics cards were not until re-cently capable of carrying out computations in double precision.Our current implementation uses double precision for all vectorsallocated on both GPU and CPU. A corrector has been implementedallowing accelerations to be computed in single precision but usingdouble precision accuracy for the particle separations (Gaburovet al., 2009). If we wrote a similar corrector we could run our inter-action step in single precision (but with nearly double precisionaccuracy) achieving a potential speed up of a factor of roughly 8on CUDA 1.3 compatible devices. The newest CUDA 2.0 compatibledevices have superior double precision capabilities, limiting thispotential speed up to a factor of 2. Additionally, it would be moredifficult to create a corrector for the Keplerian evolution step, con-sequently the current version of the integrator is exclusively indouble precision.

We work in a lengthscale in units of the outermost planet’s ini-tial semi-major axis and with a timescale such that GM⁄ = 1 whereM⁄ is the mass of the central star. In these units the innermost pla-net’s orbital period is 2p, however we often describe time in unitsof the innermost planet’s initial orbital period.

2.1. Close encounters

Symplectic integrators cannot reduce the timestep during closeencounters without shifting the Hamiltonian integrated anddestroying the symplectic properties of the integrator (e.g.,Yoshida, 1990). During a close encounter one of the interactionterms in HInt becomes large compared to the Keplerian term, HKep.Consequently, the symplectic integrator described above becomesinaccurate when two massive objects undergo a close approach. Topreserve the symplectic nature of the integrator Duncan et al.(1998) used an operator splitting approach and decomposed thepotential into a set of functions with increasingly small cutoff radii.Chambers (1999) instead used a transition function which becauseof its relative simplicity (and reduced numbers of computations)we have adopted here. A transition function can be used to movethe strong interaction terms from the interaction Hamiltonian tothe Keplerian one so that the entire Hamiltonian is preserved(Chambers, 1999). The Keplerian Hamiltonian becomes

H0Kep ¼XN

i¼1

P2i

2mi� Gmim0

Q ij j

!�XN

i;j¼1;j–i

Gmimj

2qijð1� KðqijÞÞ ð6Þ

and the interaction Hamiltonian becomes

H0Int ¼XN

i¼1

XN

j¼1;j–i

�Gmimj

2qijKðqijÞ ð7Þ

where qij = jQi � Qjj and K(qij) is a transition or change-over functionthat is zero when the distance between two objects is small and 1when they are large, thus H0Kep þ H0Int ¼ HKep þ HInt . The transitionfunction we use is

KðyÞ ¼0 if y 6 0

sin yp2 if 0 < y < 1

1 if y P 1

8><>:

Page 4: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

448 A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455

with

y ¼1:1qij

rcrit� 0:1 ð8Þ

The parameter y becomes zero at qij � 0.1rcrit where rcrit is a criticalradius.

The above transition function is similar to that chosen byChambers (1999) but we use a sine function instead of a polyno-mial function. Certain transcendental and most trigonometricfunctions, such as sine, have special high performance versionsavailable in CUDA. These high performance versions, particularlyin single precision, are somewhat less accurate5 but can be com-puted very quickly. This particular optimization reduces the numberof floating point computations from the 8 required for the polyno-mial function used by Chambers (1999)) to just a few.

The choice of transition function is arbitrary, and there arelikely to be functions that are somewhat more accurate. However,investigation into the choice of transition function did not revealsignificant deviations in energy conservation for a few of our testchoices. Additionally, in choosing the sine function over a polyno-mial involving interparticle distance, our Hamiltonian remainsconservative, allowing for true energy checking at this stage ofthe computation. This has the obvious benefit of allowing codeaccuracy checks but also has the pragmatic effect of allowing usto more easily identify bugs in the code.

2.2. Hermite integrator

The numerical integrator used for particles undergoing closeapproaches also must be reasonably fast as we would like to makeit possible for our integrator to integrate as many particles as pos-sible. Consequently we have chosen a 4th order adaptive step sizeHermite integrator for particles undergoing close approaches in-stead of the Bulirsch-Stoer integrator used by Chambers (1999).This integrator forms the heart of many N-body integrators (e.g.,Makino et al., 2003). Our code follows the algorithm described byMakino and Aarseth (1992) but does not use the Ahmad–Cohenscheme. While the Hermite integrator runs on the CPU, we haveimplemented the routine to integrate the accelerations and jerksboth on the CPU and GPU. The GPU version is called if the numberof particles integrated exceeds a certain value, npswitch.

The Hermite integrator must be modified to use the transitionfunction for computation of both accelerations and jerks (timederivatives of the accelerations). Using predicted velocities weevaluate the acceleration ai of particle i according to the following

ai ¼Xj>0

Gmjqij

s3ij

1� KðsijÞ þ sijK0ðsijÞ

� �þ Gm0

qi0

s3i0

ð9Þ

where qij = qi � qj. Here sij ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiq2

ij þ �2q

and � is an optional smooth-ing length (see Eq. (2) by Makino and Aarseth (1992) forcomparison).

The jerks are evaluated with

_ai ¼X

j

Gmj

s3ij

(ðvij � qijÞqijK

00ðsijÞ

þ vij �3 vij � qij

� qij

s2ij

!1� KðsijÞ þ sijK

0ðsijÞ� �)

þ Gm0

si0vi0 �

3 vi0 � qi0ð Þqi0

s2i0

�ð10Þ

where vij ¼ _qi � _qj are the difference between predict position andvelocities vectors, respectively, for particles i and j;

5 An accurate accounting of all ULP errors for each function can be found in theCUDA programming guide

The transition radius rcrit was described by Chambers (1999) interms of the mutual Hill radius or

RMH �mi þmj

3M0

�1=3 ai þ aj

2

h ið11Þ

though Chambers (1999) also included a expression that dependson the speed. One problem with this choice is that the force be-tween particles don’t solely depend on the distance between theparticles as they also depend on the distance to the central star. Thisimplies that the forces are not conservative and makes it more dif-ficult to check the energy conservation of the Hermite integrator.Instead we choose a critical radius, rcrit, prior to the Hermite inte-gration and keep it fixed during the integration. The critical radiusrcrit is defined to be a factor aH times the maximum Hill radius ofthe particles involved in an encounter at the beginning of the Her-mite integration.

rcrit ¼ aH �max rH;ifor i in encounter list�

ð12Þ

where rH,i is the Hill radius of particle i.When using the Hermite integrator we do not allow the central

star to move as we keep the system in heliocentric coordinates. Wehave checked that the total energy given by H0Kep (Eq. (6)) is wellconserved by the Hermite integrator. We find that the accuracyis as good as that using the Hermite integrator lacking the transi-tion function and is set by the two parameters controlling thetimestep choice g and gs (see Section 2.1 and Eqs. (7) and (9)Makino and Aarseth, 1992).

We note that our modified interaction step requires N2 compu-tations and adding in the transition function would substantiallyadd to the number of computations involved in computing acceler-ations. We instead make use of the list of particles identified duringour encounter identification routine to correct the forces on theparticles that are involved in encounters. Consequently all interac-tions are computed and then only those involved in encounters arecorrected just before the Hermite integrator is called.

2.3. Integration procedure for the hybrid integrator

Our procedure for each time step is as follows:

1. Do a drift step (evolving using HDft; Eq. (2)) for all particlesfor timestep s/2 on GPU.

2. Do an interaction step (evolving using HInt; Eq. (4)) using allparticles and for all interactions and without using the tran-sition function K for timestep s/2 on GPU.

3. Do a Keplerian step (evolving using f and g functions) for allparticles for timestep s on the GPU. Note that the onesundergoing close approaches have been inaccurately inte-grated but will be corrected later.

4. Use stored positions and velocities (q0,v0,q1,v1) prior andafter the Keplerian step in global memory on the GPU toidentify close encounters on the GPU. If there are encoun-ters, transfer pair lists onto the CPU and divide the list ofparticles involved in encounters into non-intersecting sets.Calculate the maximum Hill radius for particles in eachencounter list and use this radius to compute a criticalradius rcrit for each encounter set. Only if there are encoun-ters are the positions and velocities prior and after the Kep-step copied onto the CPU.

5. For each encounter set, subtract interactions that should nothave been previously calculated in step #2 using storedpositions and velocities on the CPU. Previously we calculatedthe interaction step using all interactions. However weshould have weighted them with a transition function K.Now that we have a critical radius, rcrit, estimated for each

Page 5: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455 449

encounter list we can correct the interaction step. Afterinteractions weighted by 1 � K have been subtracted, theinteraction step effectively calculated is H0Int (Eq. (7)) ratherthan HInt (Eq. (4)).

6. Use a modified Hermite integrator to integrate closeapproach sublists for s using the stored CPU positions andvelocities q0, v0.

7. For each encounter set, subtract interactions that will beincorrectly calculated by a repeat of the interaction step instep #9.

8. If there have been encounters, copy only particles involvedin encounters back into the arrays q1 and v1 on the CPU.Copy the entire arrays on the GPU. Particles involved inencounters have been integrated using a Hermite integrator.Particles not involved in encounters have had Keplerian evo-lution only.

9. Do an interaction step (using HInt) using all particles and forall interactions and without using the transition function Kfor timestep s/2.

10. Do a drift step (using HDft) for all particles for timestep s/2.

As we have checked for encounters during every time step wecan flag encounters involving more than one planet. In this casewe can choose to do the entire integration step for all particleswith the Hermite integrator (but with accelerations and jerks com-puted on the GPU). As planet/planet encounters are rare this doesnot need to be done often. Hybrid symplectic integrators such asMERCURY (Chambers, 1999) have largest deviations in energy dur-ing infrequent planet/planet encounters. By integrating the entiresystem with a conventional N-body integrator during planet/pla-net encounters we can improve the accuracy of the integratorwithout compromising the long term stability of the symplecticintegrator.

2.4. Identifying close encounters

Close encounter identification is in general an order N2 compu-tation as all particle pairs must be checked every timestep. This ispotentially even more computationally intensive than computingall interactions. We do this with two sweeps, each one consideringfewer particle pairs. The first sweep is crude, covers all possibleparticle pairs and so is order N2. This one should be as fast as pos-sible to minimize its computational intensity. Shared memory isused for particle positions in a tile computation similar to that usedto compute all force interactions. The computation can be donewith single precision floating point computations and we can beconservative rather than accurate with encounter identification.For each particle we compute its escape velocity (and this is orderN as it is only done for each particle). A particle pair is counted ifthe distance between the particles is smaller than the sum of a fac-tor times the mutual Hill radius times the distance moved by thefirst particle moving at its escape velocity during the timestep.

The first kernel call sweeps through all particle pairs but onlycounts the number of possible interactors. An array of counts(one per particle) is then scanned in parallel using the parallel pre-fix (scan) function (Harris et al., 2008) available in the CUDPP sub-routine library. CUDPP is the CUDA Data Parallel Primitives Libraryand is a library of data-parallel algorithm primitives such as paral-lel prefix-sum, parallel sort and parallel reduction.6 The second ker-nel call then uses the scanned array to address locations to recordpair identification numbers identified in the crude sweep. As thenumber of pairs is identified during the first kernel call memoryrequirements can be considered before calling the second kernel call.

6 http://gpgpu.org/developer/cudpp.

A second more rigorous sweep is done on the pairs identifiedfrom the first one. For this sweep we use the particle positionsand velocities computed using Keplerian evolution at the begin-ning and end of the timestep. We use the third order interpolationscheme described in Section 4.4 by Chambers (1999) to predict theminimum separation during the timestep which we compare tothe sum of their Hill radii. Pairs which fail this test are marked.Pairs which approach within a factor aE times the sum of their Hillradii are marked as undergoing a close encounter. Using a secondscan we repack the list of identified pairs into a smaller array.

After all pairs undergoing encounters have been identified, wesort them into non-intersecting sublists. This is done on the CPUas we expect the number of pairs now identified is not large.

We note that since we used the escape velocity in our firstsweep to identify pairs of particles undergoing encounters, wecould miss encounters between particles escaping from the systemand other particles. The number missed we suspect would besmall.

The number of pairs identified in the first crude sweep dependson the timestep and the particle density. We could consider otheralgorithms for removing possible particle pairs from considerationas long as we keep in mind that the first sweep should remove asmany pairs as possible while being as efficient as possible.

2.5. List of parameters

We review some parameter definitions.

1. The timestep s. As the symplectic integrator is second order theerror should depend on s3.

2. The Hill factor aH is used to define rcrit (Eq. (12)). This parameteris needed to compute the transition function K and so is neededby the Hermite integrator and to compute interactions whenthere are close encounters (Eq. (7)). This parameter is a distancein Hill radii of the most massive particle involved in a closeencounter.

3. The Hill factor aE. Particle pairs with minimum estimatedapproach distances within aE times the sum of their Hill radiiare identified as undergoing close approaches.

4. The smoothing length, �H, used in the Hermite integrator. As wedo not yet take into account actual collisions, this parametershould be small but not zero. A non-zero smoothing length willprevent extremely small timesteps in the event of a closeapproach of two point masses.

5. The parameters setting the accuracy of the Hermite integratorg, and gs (as discussed and defined by Makino and Aarseth(1992)).

6. The number npswitch. If the number of particles involved in anencounter is larger than this number then the Hermite integra-tion is done on the GPU rather than on the CPU.

3. Test integrations

The chaotic nature of the many-body problem makes it chal-lenging to check the accuracy of any code designed to simulate so-lar or extrasolar systems. There is no simple way to generateanalytical solutions to a set of initial conditions. However, it is pos-sible for us to run our integrator through a suite of tests and com-pare those results to the integrators that form the basis or our code,namely the hybrid symplectic integrator MERCURY by Chambers(1999) and the democratic heliocentric modification to mixed var-iable symplectic integrators Duncan et al. (1998) implemented inthe SyMBA package.

One of the most basic tests we ran on the integrator was tocheck that smaller timesteps ensured greater energy conservationfor a given set of initial conditions. We use the relative energy error

Page 6: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

450 A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455

as a metric for measuring the conservation of energy in a simula-tion. This relative energy error is computed with the formulaDE = (E � E0)/E0 where E0 is the energy at the beginning of the com-putation. Indeed, we do observe superior energy conservation withsmaller timestep sizes with the error scaling O(s3) as expected asthe integrator is second order. Two simple test integrations of1024 particles and identical initial conditions with timesteps of0.1 and 0.01 were completed. Over 10 timesteps of 0.1 the relativeenergy error was DE = �8.429 � 10�6 with a per step average en-ergy error of DE = �8.429 � 10�7. Over 100 timesteps of 0.01 therelative energy error was DE = �8.422 � 10�8 with a per step aver-age energy error of DE = �8.422 � 10�10. Comparing the averageenergy error per step for these two test simulations indicates anexact scaling with s3.

3.1. Enhanced or scaled outer solar system

Our primary test integration is an integration of a scaled ver-sion of the outer solar system. Both SyMBA and MERCURY weretested in this manner (Duncan et al., 1998; Chambers, 1999).The simulation consists of the four giant planets in the Solar sys-tem but with masses increased by a factor of 50. Previous simu-lations demonstrate that this configuration is unstable (Duncanand Lissauer, 1997), although they disagree on the eventual out-come due to the chaotic nature of solar system evolution. Asnoted by Chambers (1999), the timestep chosen can have an ef-fect on the eventual outcome even when varied only slightly.With these facts in mind, we ran a simulation with values asclose as possible to those of previous tests. To match the simula-tion described in Section 5.1 by Chambers (1999) we used a time-step of s = 0.00255P0 where P0 is the initial orbital period of theinnermost planet. We set the Hill factor aE = 2.5 for encounterdetection and aH = 1.0 setting the transition radius, rcrit. For initialconditions we used epoch J2000.0 orbital elements for the fourgiant planets (Standish et al., 1992). The energy error for this sim-ulation is shown in Fig. 1. Time is given in orbital periods of theinnermost planet. We ran our simulation for the same number of

-1e-06

0

1e-06

2e-06

3e-06

4e-06

5e-06

6e-06

7e-06

8e-06

9e-06

1e-05

0 200 400

dE/E

t (

Fig. 1. We show the relative energy error (E � E0)/E0 from an integration of the outer sola50. Time is in units of the orbital period in years of the innermost planet. The energy erThere is a planet/planet encounter at a time of about 200 years. Eventually planets are

orbital periods as did Chambers (1999) in the test shown in theirFig. 2.

Despite the highly chaotic and unstable nature of this system,our integration is remarkably quite similar to that by Chambers(1999) (see their Fig. 2). The energy error is bounded, as expectedfor a symplectic integrator. The spikes in the energy error are alsoevident with other integrators (see Fig. 2 by Chambers (1999) andFig. 6 by Duncan et al. (1998)). A close encounter between Jupiterand Saturn is experienced at approximately 200 years into theintegration, which causes a jump in the relative energy error. Asimilar jump in energy was also seen by Chambers (1999). Oursimulation also involved the later ejection of more than one planet.

The energy error of our integration is bounded – typical of afixed timestep symplectic integrator. However the sizes of the indi-vidual spikes in energy error is larger than those shown in Fig. 2 byChambers (1999) though it is similar in size to those shown in fig-ure 6 by Duncan et al. (1998). The spikes in Fig. 2 by Chambers(1999) are of order 10�6 in fraction error whereas ours are of order10�5. There are several possible explanations for the worse perfor-mance of our integration. During close approaches we are using aHermite integrator rather than the Bulirsch–Stoer used byChambers (1999). However we have measured the error acrosseach close approach and find conservation of H0Int at a level ordersof magnitude below 10�5 so the choice of integrator for close ap-proaches is unlikely to be the cause. We have checked our driftand Keplerian evolution operators and find they conserve theirHamiltonians within the precision of double precision arithmetic.We find little dependence on energy error in the form of the tran-sition function, or the Hill factor aH setting rcrit. However the Hillfactor influencing the identification of close encounters, aE, does af-fect the energy error. During encounters, terms in the interactionHamiltonian (Eq. (4)) can become large. However we add theseinto the interaction term during the interaction evolution. Theseterms are then removed subsequently once encounters are identi-fied so that H0int is calculated (see the discussion in Section 2.1 andsteps 3 and 7 in Section 2.3). This procedure for removing theincorrectly calculated interaction terms could account for thesomewhat poorer performance of our integrator.

600 800 1000years)

r system (Jupiter, Saturn, Uranus, Neptune) only with masses enhanced by a factor ofrors are bounded. This behavior is typical of fixed timestep symplectic integrators.

ejected.

Page 7: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

-2e-06

-1.5e-06

-1e-06

-5e-07

0

5e-07

1e-06

1.5e-06

2e-06

2.5e-06

0 50000 100000 150000 200000 250000

dE/E

t (years)

Fig. 2. We show the relative energy error in a simulation of the outer solar system (Jupiter, Saturn, Uranus, Neptune). Initial orbital elements are those of the giant planets atepoch J2000.0. Time is given in units of the orbital period in years of the innermost planet.

Table 1Profile for 1024 particles.

Function/kernel

No. of calls GPU time(ls)

CPU time(ls)

GPU time(%)

Interaction 200 790202 792607 51.40Hermite 80 537281 539429 34.95Sweeps 300 147786 151312 9.59Mem. trans. 2566 22567 41226 1.46Keplerian 100 17116 18308 1.11Energy 3 11227 11287 0.73Drift 200 1770 4007 0.11Various 1521 9217 27259 0.57

This table shows the fraction of time spent in different tasks for an integration of1024 particles integrated for 100 timesteps on an NVIDIA GTX 285. The leftmostcolumn lists the Kernels. ‘Mem. Trans.’ denotes the time involved in memorytransfers.

A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455 451

3.2. Long term outer Solar system evolution

For our second integration we compare the relative energy errorof the evolution of the outer Solar system with much larger time-steps and for a much larger amount of time – similar to the testintegration discussed by Duncan et al. (1998) and shown in theirFig. 2. In this test, we again use the epoch J2000.0 orbital elementsas initial conditions but use a timestep of s = 0.0318P0 where P0 isthe initial orbital period of the innermost planet. The timestep usedand the total integration time (�3 � 105 yr) are similar to thosevalues used in Duncan et al. (1998). The masses of the planetsare unchanged from their accepted values. This configuration isknown to be stable so we did not expect any close encounters orejections and a somewhat better relative energy error comparedto the last simulation despite the increase in timestep size. As inthe previous test, we did not observe a relative energy error asgood as Duncan et al. (1998); our accuracy being a factor of afew poorer in terms of both the average energy error as well asthe size of the fluctuations in the energy error. No encounters arepresent in this simulation suggesting that we have somewhat lar-ger sources of errors in our interaction or Keplerian evolution stepsthan SyMBA. We are not yet sure what is causing this low level oferror as these computations have been done in double precisionand when tested individually for single timesteps, we have foundthem accurate.

3.3. Sensitivity of parameters

We find that the energy error is insensitive to the Hill factor aH

setting the transition function but is quite sensitive to aE, the Hillfactor for encounter detection. We find the best numerical resultsfor aE in the range 1 to 4. The larger the value of aE the moreencounters are sent to the Hermite integrator and the slower theintegration. However if aE is too small then the difference betweenthe integrated Hamiltonian and true one will be large as the inter-action terms become large.

4. Benchmarks and profiling

In this section we discuss the fraction of runtime spent doingeach computation in some sample integrations. Computations thatare run on the GPU are called kernels. In Tables 1 and 2 we list thefraction of GPU time spent in each kernel or group of kernels ordoing memory operations for these two different simulations.We also list the CPU runtime for each kernel which includes theoverhead for calling the device function in addition to the runtimeon the GPU. We label the kernels or groups of kernels as follows:Interaction (evolving using Hint), Keplerian (evolving using HKep),Sweeps (the kernels for finding close encounters), Drift (evolvingusing HDft), Energy (evaluating the total energy on the GPU andonly done once per data output), and Hermite (when the Hermiteintegrator is run on the GPU). Also listed is the total time spentdoing memory allocation and transfers (listed under ‘Memory’).All other kernels are listed under ‘Various’ and the computationtimes are summed. Kernels in the ‘Various’ category include centerof momentum calculations, scans, repacking, etc. that are not partof the previous listed kernels. Three sweeps are called, the first twopassing over all particle pairs. The sum of the time spent in allsweeps is shown in the table under Sweeps.

Page 8: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

7 GPU time in % listed by the CUDA profiler does not always add up to 100 forreasons of significant figures.

Table 2Profile for 10240 particles.

Function/kernel

No. ofcalls

GPU time(ls)

CPU Time(ls)

GPU time(%)

Interaction 200 2.29 � 107 2.29 � 107 53.21Hermite 80 1.60 � 107 1.60 � 107 37.20Sweeps 300 3.58 � 106 3.59 � 106 8.30Energy 3 302848 302920 0.70Mem. trans. 2566 171014 316417 0.39Keplerian 100 48186 49397 0.11Drift 200 8183 10427 0.01Various 1721 15529 35350 0.02

Similar to Table 1. This profile is for 10240 particles integrated for 100 timesteps onan NVIDIA GTX 285.

Table 3Comparing Kernel speeds for 103, 104 and 105 particles.

Kernel No. particles No. of calls Time ls GPU time (%)

Interaction 102400 20 1.965 � 108 77.2610240 2.286 � 106 76.911024 79019 71.84

Sweeps 102400 30 3.071 � 107 12.0610240 356295 11.981024 14776 13.41

Energy 102400 3 2.689 � 107 10.5910240 307708 10.351024 11210 10.19

Various 102400 507 234894 0.0810240 477 21849 0.701024 457 4986 4.51

This profile shows something akin to the asymptotic limit of simulations which donot have objects experiencing close approaches. Each simulation is of a singleoutput of 10 timesteps on a 285 GTX with the specified number of particles.

452 A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455

The simulations described in Tables 1 and 2 are identical exceptin the number of particles in the simulation. The initial conditionsconsist of 4 massive planets inside a debris disk which is signifi-cantly less massive than the planets and is truncated relativelyquickly. Both simulations were run for 1 output of 100 timesteps.We note that the energy kernel is only called once per output.The set of initial conditions was chosen so that the system wouldquickly have close encounters involving two planets and so forcethe integrator to call the Hermite integrator with all particles. Thisallowed us to measure the performance of different kernels. Rele-vant profiler information includes the number of kernel calls madeof a particular function, total time to complete all calls of that func-tion on both the CPU and GPU and the percentage of GPU runtimespent running each kernels. Tables 1 and 2 represent a scenario inwhich collision are frequent and the Hermite integrator is calledfrequently.

Profiling for the code was done with the NVIDIA CUDA Visualprofiler version 3.0 that is available with the CUDA toolkit. Profilingwas done on two video cards, the NVIDIA GeForce GTX 285 and theGeForce GTX 480. Tables 1,2 only show the times for computationson the GPU alone and CPU overhead + GPU time to call these func-tions, but do not show the fraction of total computation time onthe GPU. However, using the GPU Utilization Plot available in theVisual profiler, we are able to determine the session level (an entireintegration) GPU utilization. For 103 particles we achieved a utili-zation of 93% and for 104 particles a utilization of 85%. For 105 par-ticles, utilization varied dramatically, but never dipped below�50%. The overall utilization dropping for higher number of parti-cles may seem counterintuitive, but makes sense in light of theincreasing number of particles that are flagged for close ap-proaches and sent to the CPU for integration. In fact, in certaincases, particularly with low densities or small hill factor identifica-tion radii, 105 particle simulations would display very high utiliza-tion. It is important to note that these utilization percentagesrepresent the time that the GPU is not idle – it is not indicativeof the actual performance of a particular kernel or the code as awhole. There are other more useful metrics for individual kernelperformance. Also, we note that the Hermite integrator is the onlymajor routine in the code that is run on the CPU and because of thisthe host processor speed and quantity of main system memoryhave little effect on the values given in the tables below. The effectof the higher CPU speeds is to decrease the total runtime and to ef-fect the GPU utilization values. Only in simulations with lower GPUutilization percentages do the effects of increased processor speedbecome apparent. Altering main system memory amounts orspeeds has no practical effect on our runtime due to the relativelysmall amount of memory being used – even for 105 particlesimulations.

We observed that nearly all of the runtime is dominated bythe interaction step, the Hermite integrations and the close ap-proach detection kernels labeled ‘‘Sweeps’’ in the table. Memory

operations, the drift and Keplerian evolution steps of the sym-plectic integrator and all of the other various functions on theGPU add up to only a small fraction of the runtime. This percent-age will continue to decrease as the particle number is increasedas the interaction step is O(N2) but the Keplerian evolution anddrift step are O(N). The ratio of time spent doing sweeps com-pared to that in the interaction step may increase with N as theremay be more encounters when the particle density is higher.Comparing Tables 1 and 2 we see that the fraction of time spentin the interaction step is only somewhat larger when the numberof particles is larger. This implies that the fraction of time spentdoing operations that are O(N) is small even when the numberof particles is only 1024.

Examining the timing information from the simulations with10240 particles we note that on average a single call to the Hermiteand interaction kernels took about 0.20 and 0.11 s respectively. Wealso observe that the CPU + GPU runtime are nearly identical to theGPU runtime alone for all major kernels. It is only with smallernumbers of particles or kernels that are O(N) that these values di-verge even slightly.

In Table 3 we show a comparison of simulations with three dif-ferent particle numbers and identical initial conditions as those gi-ven for the first two sets of simulations described in the first twotables. However, these were evolved for only 10 timesteps insteadof 100 before outputting data. These integrations lack closeencounters and so serve to compare sweep, energy and interactionkernels. For simulations with sparse debris disks or low mass ob-jects, this third table describes how the GPU runtime will be spent.Clearly, the interaction step, energy calculation and sweeps com-pose a majority of the runtime. As larger number of timestepsare taken per data out or if we suppress the calculation of the en-ergy, we will reach an asymptotic limit of 85% GPU time spent oninteraction step with the remaining 15% of the GPU time spent onthe sweeps for encounter detection.7

4.1. Optimizations

We have optimized our code beyond the standard CPU softwareoptimization techniques by using some of the ‘‘best practices’’ forGPU programming. Refer to the ‘‘Best Practices Guide – CUDA3.0’’ for an in depth description of low, medium and high priorityoptimizations. The most important and most obvious best practiceis to run as much of your code on the GPU as possible while

Page 9: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

8 Double precision values are 64-bit, requiring 2 registers per value.

A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455 453

simultaneously implementing each kernel to take advantage of asmuch parallelism as possible. To this end, we have put nearly allof the computations for our integrator on the GPU including allthe evolution operators in our symplectic integrator (drift, Kepleri-an, interaction), all collision detection sweeps, the Hermite integra-tion routine as well as the energy computation. As shown in theprofiling, the execution time of a non-interacting system is domi-nated by the O(N2) interaction term; a kernel with a high degreeof arithmetic intensity and minimal memory transfer comparedto the number of floating point operations. Some of these routinesshow small GPU performance benefits over the CPU because oftheir large size in terms of registers, lack of arithmetic intensityin that they are O(N), and their inability to use shared memory.However, even these serial routines are executed on the GPU to en-sure as few host to GPU or GPU to host memory transfers. This bestpractice is very important because it ensures fewer high latencymemory transfers between the system memory and global mem-ory on the GPU and again from global memory to the multiproces-sors. The more time that can be spent doing floating pointoperations rather than memory transfer allows for the greatestGPU speed increases. Other high priority optimizations are to ac-cess shared memory over global memory whenever possible andto keep your kernels from having diverging execution paths. Someroutines were not or could not be written to use shared memory,but all O(N2) sweeps and the interaction step leverage this tremen-dous performance enhancer. All kernels are written to minimizebranching statements to ensure that execution paths do not di-verge. This best practice manifests itself most obviously in theinteraction detection and Hermite integration kernels. Collisiondetection was not incorporated in the same function, but ratherin a separate kernel. In another example of code design choice,we over count and do N2 computations in the interaction step in-stead of N(N � 1)/2 since it allows for us to have simpler code thatis more easily executed in parallel on the GPU.

Medium priority optimizations including use of the fast math li-brary that could potentially effect the accuracy of our code in anegative way are not implemented. We do use multiples of 32threads for each block and we are able to attain a 33% occupancyrate for the interaction step, the energy computation and theGPU version of the Hermite gravity step as well as a 50% occupancyrate for the first and second sweep steps on a GF100 based GTX480. The concept of occupancy is a complicated one and is ex-plained in more detail in both the ‘‘Best Practices Guide – CUDA3.0’’ guide and the ‘‘CUDA Programming Guide 3.0’’. At it’s coreit’s simply a ratio of the number of active warps in a multiproces-sor to the maximum number of possible warps the multiprocessorcan maintain. A warp is simply a group of 32 threads to be exe-cuted on a multiprocessor. While computing the number of activewarps per multiprocessor involves hardware knowledge beyondthis document, and noting that higher occupancy does not neces-sarily equate to better performance, it is important to maintain aminimum occupancy. Below some value, the latencies involvedwith launching warps can not be hidden. The value suggested inthe ‘‘Best Practices Guide – CUDA 3.0’’ suggest an occupancy of atleast 25% be maintained. As mentioned, we are able to attain this.

We also attempt to ensure optimal usage of register space perblock and to loop unroll functions when it can increase perfor-mance. The number of loop unrolls that can be achieved is deter-mined by examining the numbers of registers used per thread fora given function and multiplying it by the number of threads perblock that you wish to issue. This gives the total number of regis-ters used per multiprocessor – the group of CUDA cores that ablock is issued to. This value must be smaller than the number ofregisters available on the relevant architectures multiprocessor.Increasing the number of loop unrolls in a given kernel can alterthe register requirements, so there is a limit to the number of loop

unrolls that can be implemented. For reference, the GT200 archi-tecture (GTX 285) and the GF100 architecture (GTX 480), thereare 8 and 32 CUDA cores per multiprocessor, 16 K and 32 K32-bit registers, and 16 KB and 48 KB per multiprocessor of sharedmemory respectively.8 All relevant hardware numbers includingnumber of threads, multiprocessors, cores per multiprocessor,amount of shared memory, etc. can be found in the CUDA Program-ming Guide in appendices A and G. It is possible to artificially restricta function to use a number of registers that is less than it requires.However, this forces some values to be stored in local memory (glo-bal memory), which, as mentioned previously, carries a very heavyperformance penalty in terms of latency. For this reason, it can bedetrimental to limit the number of registers per thread below thatwhich is required. We found that our performance was best with aloop unroll of 4, a maximum number of registers per thread set to64, and 128 threads per block.

4.2. Use of parallel primitives

The parallel computations used by this code for the most partutilize parallel primitives and so can be ported to other parallelcomputation platforms with similar parallel primitive libraries.The energy computation, interaction step and all pairs sweep iden-tification routines essentially utilize the same tiled shared memoryalgorithm (described by Nyland et al. (2008). The drift step andencounter detection use a parallel prefix sum (see Harris et al.,2008).

4.3. Areas for future optimization

There are several areas in which our code could be further opti-mized. Without drastically altering the code there are some ‘‘lowpriority’’ optimizations such as the use of constant memory forunchanging values like smoothing lengths and the Hill factors.

More important hardware level optimizations could includealternate ways to ensure global memory coalescing and preventshared memory bank conflicts by padding our current data struc-tures or disassembling them entirely and using single arrays ofdouble precision elements to store data.

On a higher level, there are several optimizations that couldpossible speed up the code significantly. First, replacing the inter-action step which is currently an all-pairs O(N2) calculation with aGPU enhanced tree integration (e.g., Richardson et al., 2000;Gaburov et al., in press) could bring a speed increase to the codeand allow larger numbers of particles to be simulated.

Additionally, the sorting routines (‘sweep kernels’) could beoptimized to use faster detection methods than the current O(N2)methods. By implementing a parallel sorting algorithm or simplymaking our detection routine more intelligent, we could reducethis part of the runtime. For simulation of many massless but col-liding particles (such as dust particles) in the vicinity of planetsand planetesimals an improvement in the encounter identificationmay allow us to integrate many more particles.

A potentially simpler change would be the implementation ofdouble precision in software rather than hardware. Double preci-sion code on NVIDIA GPU’s executes more slowly than single pre-cision, as is often the case in hardware. Depending on the device,this factor can be anywhere from 1/8th to 1/2 the speed and someolder CUDA capable devices do not support double precision at all.By implementing double precision in software we would be able tocompile our code using 32-bit floating point precision. This wouldhave the benefit of multiplying the speed of the code you are run-ning by a factor of at least a few and could possibly allow devices

Page 10: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

454 A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455

not originally capable of supporting the code to run. Additionally, itappears that only GF100 devices in the Tesla brand of cards willsupport full double precision speeds which run at 1/2 the numberof double precision flops. A penalty of 1/8th the number of singleprecision flops was observed while profiling our code on a GTX480. This was observed indirectly by only noticing a speed increaseof 2� on our GF100 based machine for our double precision kernelslike the interaction step and the energy computation over ourGT200 based GPUs. At 1/8th the peak double precision flops, theGF100 based GPU should be 2� faster than a GT200 based GPU be-cause it has 2� the number of cores over the GT200 based cards(ignoring minor execution time differences based on the actualcore clock speeds). The GT200 based GPUs are known to have adouble precision flops peak 1/8th that of the single precision peak.In practice, it should be noted that the GF100 cards run faster than2� that of GT200 GPUs due to other architectural optimizationsbetween the cards. In particular, small kernels which have notbeen or do not parallelize as well run much faster on the GF100GPUs. Lastly, in addition to our code supporting more devicesand running 2� the speed, the code would be easier to optimizeto ensure global memory coalescing and to prevent shared mem-ory bank conflicts due to the 4 byte size of single precision floatingpoint.

However, software implementation of double precision hasdownsides. Notably, it is often difficult to achieve the same preci-sion as double precision floating point when using two single pre-cision floating point values. Additionally, full conversion of thecode to use single precision floating point values and redefining ba-sic vector operations would involve extensive reworking of thekernels that could lead to dozens of extra operations in each partic-ular kernel. It is not obvious that this sort of optimization is en-tirely suited to the GPU because increasing the number ofoperations in a kernel can increase register usage – an effect weare trying to avoid for optimization reasons. Finally, simply modi-fying the distance calculation to be in double precision and com-puting the rest of the acceleration step in single precision wouldresult in the largest performance benefit with the least amountof coding required but could potentially lose a large amount of pre-cision in our calculations.

Due to the previously mentioned register restrictions and thecomplexity of our code, it is often difficult to achieve high occu-pancy on the devices we use. It may be possible to rewrite entireroutines in non-obvious ways to reduce the number of maximumregisters in use at a given point in time to remedy this problem.This is only a ‘‘medium priority’’ optimization and is probablythe most difficult due to the number of ways kernels can be re-written. This optimization is also hardware dependent and somay not be a worthwhile optimization.

5. Summary and conclusion

We have described an implementation of a hybrid second ordersymplectic integrator for planetary system integration that permitsclose approaches. It is similar in design to SyMBA (Duncan et al.,1998) and MERCURY (Chambers, 1999) but is written in CUDAand works in parallel on a GPU. The code is almost as accurate asthe older integrators but is faster when many particles are simul-taneously integrated. Bounded energy errors are observed duringnumerical integration of a few test cases implying that our integra-tor is indeed nearly symplectic. The code has been written primar-ily with parallel primitives so that modified versions can bewritten for other parallel computation platforms. The currentversion of the code does not take into account collisions betweenparticles, however we plan to modify future versions so that thesecan be incorporated into the code. We also plan to modify the

encounter algorithms so that many dust particles can be integratedin the vicinity of planets and planetesimals.

While slightly larger errors in energy conservation are presentin our integrator as compared to SyMBA and MERCURY, a lot ofinteresting planetesimal-planetesimal and planet-planetesimaldynamics can be observed in a relatively small amount of timeas compared to an integration which is examining the dynamicsor stability of multiple planet (planet-planet) systems. Addition-ally, the benefits to using this integrator, namely that you can ob-serve the dynamics of all-pairs force computations, require areasonable number of particles to have the necessary resolutionto observe interesting phenomenon. Due to the O(N2) nature of afew of the algorithms in our integrator this effectively precludesthe running of simulations on solar system length timescalesregardless of the integrators relative energy accuracy. To wit, webelieve that the energy accuracy of our integrator is sufficient forthe relatively short (as compared to solar-system length integra-tions) dynamical simulations the integrator is adept at.

Support for this work was provided by NSF through award AST-0907841. We thank Richard Edgar for help with the design and ini-tial set up of our GPU cluster. We thank NVIDIA for the gift of fourQuadro FX 5800 and two GeForce GTX 280 video cards. We thankNicholas Moore for informative discussions regarding GPU archi-tectural and coding minutiae.

Appendix A. f and g Functions

Given a particle with position x0 and velocity v0 at time t0 inKeplerian orbit, its new position, x, and velocity, v, at time t canbe computed

x ¼ f x0 þ gv0

v ¼ _f x0 þ _gv0 ðA:1Þ

in terms of the f and g functions and their time derivatives, _f and _g.Introductory celestial mechanics textbooks often discuss f and gfunctions for particles solely in elliptic orbits. However if a particlein a hyperbolic orbit is advanced using elliptic coordinates, a NaNwill be computed that can propagate via the interaction steps. Itis desirable to advance particle positions for all possible orbits,including parabolic or hyperbolic ones. This can be done by comput-ing the f and g functions with universal variables, as described byPrussing and Conway (1993). The recipe is repeated here as it is use-ful, but not available in most textbooks.

With l �ffiffiffiffiffiffiffiffiGMp

and a � 1/a where a is the semi-major axis andr0 the initial radius, the f and g functions and their time derivativesin universal variables are computed as

f ¼ 1� x2

r0Cðax2Þ

g ¼ ðt � t0Þ �x3ffiffiffiffilp Sðax2Þ

_f ¼ xffiffiffiffilp

rr0ax3Sðax2Þ � 1� �

_g ¼ 1� x2

rCðax2Þ ðA:2Þ

(see Eqs. 2.38 by Prussing and Conway (1993)). Here the variable xsolves the universal differential Kepler equation

ffiffiffiffilp ðt � t0Þ ¼

r0 � v0ð Þx2ffiffiffiffilp Cðax2Þ þ ð1� r0aÞx3Sðax2Þ þ r0x; ðA:3Þ

(Eq. 2.39 Prussing and Conway, 1993). The f and g functions must becomputed before the time derivatives, _f and _f , so that r, the radius at

Page 11: QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters

A. Moore, A.C. Quillen / New Astronomy 16 (2011) 445–455 455

time t, can be computed. This radius is then used to compute _f andg.

The needed transcendental functions are

CðyÞ ¼

12!� y

4!þ y2

6!� � � � if y � 0

1�cosffiffiyp

y if y > 0cosh

ffiffiffiffiffi�yp �1�y if y < 0

8>><>>:

and

SðyÞ ¼

13!� y

5!þ y2

7!� � � � if y � 0ffiffi

yp �sin

ffiffiypffiffiffiffi

y3p if y > 0

sinhffiffiffiffiffi�yp � ffiffiffiffiffi�y

pffiffiffiffiffiffi�y3p if y < 0

8>>>><>>>>:

ðA:4Þ

(Eq. 2.40 by Prussing and Conway (1993)).It is convenient to define the function F(x) and compute its

derivatives

FðxÞ ¼ � ffiffiffiffilp ðt � t0Þ þ

r0 � v0ð Þx2ffiffiffiffilp Cðax2Þ þ ð1� r0aÞx3Sðax2Þ þ r0x

F 0ðxÞ ¼ r0 � v0ð Þxffiffiffiffilp 1� ax2Sðax2Þ� �

þ ð1� r0aÞx2Cðax2Þ þ r0

F 00ðxÞ ¼ r0 � v0ð Þffiffiffiffilp 1� ax2Cðax2Þ� �

þ ð1� r0aÞx 1� ax2Sðax2Þ� �

ðA:5Þ

(Eqs. 2.41, 2.42 and problem 2.17 by Prussing and Conway (1993)).To solve the universal Kepler equation (Eq. (A.3), that can now

be written as F(x) = 0) the Laguerre algorithm can be computediteratively (Conway, 1986) as

xiþ1 ¼nFðxiÞ

F 0ðxiÞ � ðn� 1Þ2F0ðxiÞ2 � nðn� 1ÞFðxiÞF 00ðxiÞ��� ���1=2 ðA:6Þ

(Eq. 2.43 Prussing and Conway, 1993). Good numerical performanceis found with n = 5 (Conway, 1986). The sign in the denominator isthe same as the sign of F0(xi). The rate of convergence is cubic andconvergence is achieved for any starting value of x (Conway,1986). Using a starting value x0 = r0 we achieved convergence fora wide range of timesteps and orbital parameters to a level of10�16 in under 6 iterations. Improved choices for starting valuesof x are discussed by Prussing and Conway (1993) and Conway(1986) but require more computation than x0 = r0.

References

Capuzzo-Dolcetta, R., Mastrobuono-Battisti, A., Maschietti, D., 2011. NewAstronomy 16, 284.

Chambers, J.E., 1999. MNRAS 304, 793.Conway, B.A., 1986. Celestial Mechanics 39, 199.Duncan, M.J., Levison, H.F., Lee, M.H., 1998. AJ 116, 2067.Duncan, M.J., Lissauer, J.J., 1997. Icarus 125, 1.Gaburov, E., Harfst, S., Portegies Zwart, S., 2009. New Astron. 14, 630.Gaburov, E., Bédorf, J., Portegies Zwart, S., 2010. Procedia Computer Science 1, 1119.Harris, M., Sengupta, S., Owens, J.D., 2008. In: Nguyen, Hubert (Ed.), GPUGems3.

Addison-Wesley, Upper Saddle River, NJ, p. 851. Chap. 39.Leimkuhler, B., Rich, S., 2004. Simulating Hamiltonian Dynamics. Cambridge

University Press, Cambridge, UK.Makino, J., Fukushige, T., Koga, M., Namura, K., 2003. PASJ 55, 1163.Makino, J., Aarseth, S.J., 1992. PASJ 44, 141.Moore, A., Quillen, A.C., Edgar, R.G., 2008. arxiv0809.2855.Morbidelli, A., 2001. Ann. Rev. Earth Pl. Sci. 30, 89.Nyland, L., Harris, M., Prins, J., 2008. In: Nguyen, Hubert (Ed.), GPUGems3. Addison-

Wesley, Upper Saddle River, NJ, p. 677. Chap. 31.Prussing, J.E., Conway, B.A., 1993. Orbital Mechanics. Oxford University Press, Inc.,

New York, New York.Richardson, D.C., Quinn, T., Stadel, J., Lake, G., 2000. Icarus 143, 45.Standish, E.M., Newhall, X.X., Williams, J.G., Yeomans, D.K., 1992. Orbital

Ephermerides of the Sun, Moon, and Planets. In: Seidelmann, P.K. (Ed.),Explanatory Supplement to the Astronomical Almanac. University ScienceBooks, Mill Valley, CA.

Wisdom, J., Holman, M., Touma, J., 1996. Fields Inst. Commun. 10, 217.Wisdom, J., Holman, M., 1991. AJ 102, 1528.Yoshida, H., 1990. Phys. Lett. A 150, 262.Yoshida, H., 1993. Celest. Mech. 56, 27.