Supercomputing for Molecular Dynamics Simulations ... · a standard tool complementing the two...

Supercomputing for Molecular DynamicsSimulations: Handling Multi-Trillion Particles in

Nanofluidics

Alexander Heinecke1, Wolfgang Eckhardt2, Martin Horsch3, and Hans-JoachimBungartz2

1 Intel Corporation, 2200 Mission College Blvd., Santa Clara 95054, CA, USA2 Technische Universitat Munchen, Boltzmannstr. 3, D-85748 Garching, Germany3 University of Kaiserslautern, Erwin-Schrodinger-Str. 44, D-67663 Kaiserslautern,

Germany

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Art of Molecular Modeling and Simulation . . . . . . . . . . . . . . . 21.2 Focus and Structure of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Molecular Dynamics Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Molecular Models and Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Statistical Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Algorithms for Nearest-Neighbor Search . . . . . . . . . . . . . . . . . . . . . . 152.4 Characteristics of Large-Scale Molecular Dynamics Simulation

of Fluids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Simulation Code Mardyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Parallelization of MD Algorithms and Load Balancing . . . . . . . . . . . . . . 253.1 Target Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Shared Memory Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Spatial Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Load Balancing Based on KD Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Efficient Implementation of the Force Calculation in MD Simulations . 384.1 Memory Access Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Optimization Possibilities for Monatomic Fluids . . . . . . . . . . . . . . . 47

5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.1 Performance on SuperMUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Performance on the Intel Xeon Phi coprocessor . . . . . . . . . . . . . . . . 585.3 Multi-Trillion Particle Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

1 Introduction

Since the pioneering simulations by Alder and Wainwright in 1959 [2] and Rah-man in 1964 [74], the field of molecular dynamics (MD) simulation has undergonea remarkable development. While early simulations concentrated on simple hard-sphere systems and systems of monatomic model fluids, real-world substancescan be directly studied by computer simulation today. Significant progress hasbeen made in the construction of molecular models, so that complex molecularsystems can now be investigated reliablely. A large number of force fields allowsthe treatment of a wide range of fluids in many applications. Different simu-lation techniques such as classical molecular dynamics or Monte Carlo (MC)simulations are well-understood, alike the mathematical equations and numeri-cal methods to solve them. Computational progress, made possible through bothhardware and algorithmic development, is visible by the particle numbers andtime spans covered by contemporary simulation runs. While simulations initiallyhad been limited to systems of a few thousand molecules and could be simulatedfor picoseconds only, now the simulation of much larger systems with a totalsimulation time in the order of milliseconds [90] is within reach. Despite thatprogress, MD is not yet an universal tool. The level of development regardingits readiness for straightforward application highly depends on the field.

No other field than molecular biology is more suited to demonstrate theprogress in MD. Here, molecular simulation has a long history and has becomea standard tool complementing the two classical pillars of science, theory andexperiment. Simulation techniques are used by default to identify and studyinteresting molecular configurations, greatly reducing the number of necessaryexperiments. Molecular docking [50], e.g. is an important simulation step in drugdesign, preceding compound synthesis, and investigates how small molecules at-tach to the active site of macromolecules such as a virus. This knowledge isuseful to activate or inhibit enzymes, so that bio-molecular processes can beenabled or suppressed. It does not surprise that numerous simulation packageshave emerged in this field and that two thereof, NAMD [63] and GROMACS [97],have become the de-facto standard. These codes use standardized force fields, i.e.potential models and parametrizations for molecules, such as CHARMM [13],GROMOS [85], OPLS [46] or AMBER [15], and provide complete tool suitesfor preprocessing, e.g. initialization of proteins in solution and energy minimiza-tion of initial configurations, and post-processing, e.g. analysis of trajectories orcomputation of statistical quantities. Also the development of special purposehardware for biological MD simulation such as MDGRAPE [94] at the RIKENresearch institute in Japan or ANTON [89] by the D. E. Shaw group, both de-signed to run protein simulations, proves the high level of standardization. Apartfrom molecular biology, progress has also been achieved in other fields such assolid state physics or materials science, and is witnessed by several Gordon Bellprizes, e.g. for the study of solidification processes of metals [93].

In process engineering, the situation is different, and molecular simulationis just about to evolve as a methodology. Here, MD simulation severely suf-fered from a lack of both standardization and correct models as well as the low

1

quality of reference data [38]. Simulations still require an engineering-type ap-proach, where simulation methods, molecular models and force fields have to bechosen with care. This choice requires experience and a deep understanding ofboth the scientific problem and the simulation technique. Often different modelscan be applied for the simulation of a phenomenon, but may lead to differentresults. Consider the seemingly trivial task of simulating water: Here, one outof more than 120 molecular models [51], each featuring special characteristics,has to be chosen. In many cases, algorithms for simulation have to be developedbeforehand, e.g. how to determine the contact angle of a droplet residing ona surface. As discussed later in detail, meaningful simulation scenarios requirelarge particle numbers, often exceeding those in molecular biology by orders ofmagnitude. In these scenarios, heterogeneous particle distributions occur, ne-cessitating efficient load balancing. On the other hand, often “simpler” modelscan be used in chemical engineering, resulting in cheaper computations if im-plemented efficiently. Consequently, requirements on a code for MD simulationin process engineering are different, and algorithms need to be rethought andreimplemented for the efficient application in process engineering.

1.1 The Art of Molecular Modeling and Simulation

Computational molecular engineering (CME) is a novel discipline of simulation-based engineering and high-performance computing, aiming at adapting molecu-lar force field methods, which were developed within the soft matter physics andthermodynamics communities [3], to the needs of industrial users in chemicaland process engineering.4

We witness today the progress from molecular simulation as a theoreticaland rather academic method to CME as an arsenal of robust tools intended forpractical use, which e.g. supplements or replaces experiments that are hazardousor hard to conduct [96]. This follows the general pattern by which engineeringincreasingly drives scientific development in areas originating from pure chem-istry and physics, building on substantial basic research efforts, as soon as theyhave become ripe for technical application.

The degree of sophistication of molecular force field methods and the com-plexity of the simulated systems varies considerably between the various fieldsof application. In particular, the interdependence of elementary thermodynamicproperties such as pressure, density, temperature, enthalpy, and compositioncan be reliably obtained by simulating homogeneous systems that contain up to1,000 molecules [3]. With relatively little additional effort, higher-order deriva-tives of the free energy (e.g. heat capacities or the speed of sound) are accessibleas well [57]; the case is similar for mechanical properties of solid materials [80].By Grand Equilibrium or Gibbs ensemble simulation, vapor-liquid equilibria be-tween homogeneous bulk phases, i.e. without an interface between them, can be

4 This section is based on M. Horsch, C. Niethammer, J. Vrabec, H. Hasse: Com-putational molecular engineering as an emerging technology in process engineering,Information Technology 55 (2013) 97–101. It represents joint work of the mentionedauthors.

2

efficiently and accurately sampled [17, 96]. Systems where a phase boundary isexplicitly present can also be treated. Such simulations require more molecules,so that finite-size effects can be isolated [10], and longer computations (i.e., withmore simulation steps) need to be carried out, since fluid interfaces often re-lax more slowly than the homogeneous bulk fluid and exhibit more significantfluctuations, e.g. capillary waves, on a long time scale.

This facilitates a modeling approach that has turned out to be particularlyfruitful in recent years: Thereby, the electrostatic features of a molecular model,i.e. the choice of parameters for point charges, dipoles or quadrupoles, are de-termined from quantum chemical calculations. United-atom sites interacting bythe Lennard-Jones potential are employed for intermolecular repulsion and dis-persive London forces [3], also known as van der Waals forces. The correspond-ing potential parameters are adjusted to optimize the overall agreement withexperimental data [25]. These models are simple and account for the most im-portant types of molecular interactions separately, including hydrogen bond-ing [36]. Furthermore, they describe the microscopic structure of the fluid (localconcentrations, radial distribution functions, etc.) in a self-consistent way. Thisdistinguishes them from other approaches for describing fluid properties and ex-plains the fact that such models yield reliable extrapolations with respect to twodifferent aspects: First, to conditions far beyond those where the experimentaldata for the parameter fit were determined; second, to a wide variety of fluidproperties which were not considered during parametrization at all [23].

Furthermore, transferable pair potentials are available which directly mapfunctional groups to the model parameters of corresponding single-atom or united-atom interaction sites [58]. In this way, molecular simulation can deploy its pre-dictive power, on the basis of a physically sound modeling approach, even wherethe available set of experimental data reaches its limits.

Both MC and MD simulation are suitable for determining most thermophys-ical properties: MC simulation evaluates an ensemble average by stochasticallygenerating a representative set of configurations, i.e. position and momentumcoordinates of the molecules. Thereby, MC simulation uses the Metropolis al-gorithm (which is randomized), whereas MD simulation computes a trajectorysegment by integrating Newton’s equations of motion (which are deterministic).If the same force field is used, temporal and ensemble averaging lead to consistentresults, since thermodynamically relevant systems are at least quasi-ergodic [3].MC simulation neither relies on time nor requires an explicit computation ofmomentum coordinates, which is advantageous for simulating adsorption andphase equilibria [96]; in these and similar cases, the most effective methods in-volve grand-canonical or quasi-grand-canonical ensembles with a varying numberof molecules [10], where MD simulation has the disadvantage that momentum co-ordinates have to be determined for molecules that are inserted into the system.For more complex properties, however, e.g. regarding non-equilibrium states andthe associated relaxation processes, time-dependent phenomena become essen-tial, so that MD is the preferred simulation approach (cf. Fig. 1).

3

Fig. 1. Top: MD simulation snapshot for Couette shear flow of methane in a graphitenanopore [43]. Bottom: Entrance effects, adsorption/desorption kinetics, and perme-ability of fluid methane in nanoporous carbon, employing non-equilibrium MD simu-lation [43]. The simulations were conducted with ls1 mardyn.

Where pure component models are available, the extension to mixtures isstraightforward. Mixing rules are available for predicting the unlike interactionparameters. If suitable experimental data are available, adjustable binary pa-rameters can be employed to improve mixture models. This concept can also beapplied to modeling fluid-wall interactions, cf. Fig. 2.

Scientifically and technically, all preconditions for the introduction of molecu-lar simulation in an industrial environment are now fulfilled [36]. Organizationalaspects relevant for this process include institutional support, the active interestand involvement of both corporate and academic partners, and channeling ofthe effort to a few simulation codes, at most, rather than reinventing the wheelagain and again. In this respect, the development in Great Britain can serveas a positive example, where a community centered around the Computational

4

Fig. 2. MD simulation snapshot (left) and average fluid density contour plot (right)for a sessile argon droplet on a solid substrate. The simulation was conducted with ls1

mardyn.

Collaboration Project 5 develops and applies the DL POLY program. An exam-ple for successful collaboration between academia and industry can be found inthe United States, where the Industrial Fluid Properties Simulation Challengealso attracts international attention and participation [23]. However, the corre-sponding programming efforts are highly fragmented: Parallel developments areattempted based on the Amber, CHARMM, LAMMPS, NAMD and MCCCSTowhee codes, among many others [58,69,70].

At present, the German CME community constitutes the best environmentfor mastering the qualitative transition of molecular simulation from a scholarlyacademic occupation to a key technology in industrial-scale fluid process engi-neering. Its institutional structure guarantees an orientation towards industrialuse and successfully integrates engineering with high performance computing.It is within this framework that a consistent toolkit encompassing two majorcomponents is developed: The ms2 program (molecular simulation 2), intendedfor thermophysical properties of bulk fluids, and ls1 mardyn (large systems 1:molecular dynamics) for large and heterogeneous systems.

From a computational point of view, large MC or MD simulations can betterbe tackled than MD simulations of processes over a relatively long time span.By far the largest part of the numerical effort is required for evaluating theforce field, a task which can be efficiently distributed over multiple processes, asdiscussed above. In contrast, the temporal evolution along a trajectory throughthe phase space cannot be parallelized due to its inherently sequential nature.In the past, this has repeatedly led developers of highly performant simulationcodes to boast of the number of (low-density gas) molecules that they succeededin loading into memory as the single considered benchmark criterion [30].

However, large and interesting systems generally also require more simulationtime. Industrial users will hardly care how many trillions of molecules can be

5

simulated for a dilute homogeneous gas over a few picoseconds, or even less.From the point of view of thermodynamics and fluid process engineering, thecriterion for the world record in molecular simulation should not be the numberof molecules N , but rather an exponent a such that e.g. within a single day,at least N = 103a molecules in a condensed state were simulated over at least10a+4 time steps. This would promote a proportional increase of the accessiblelength and time scales, which is what real applications require.

By pushing this frontier forward, a wide spectrum of novel scale-bridgingsimulation approaches will become feasible, paving the way to a rigorous inves-tigation of many size-dependent effects, which on the microscale may be quali-tatively different from the nanoscale. Following this route, major breakthroughswill be reached within the coming decade, assuming that a research focus isplaced on processes at interfaces. By focusing on such applications, cf. Fig. 1,an increase in the accessible length and time scale due to massively parallelhigh-performance computing will lead to particularly significant improvements,opening up both scientifically and technically highly interesting fields such asmicrofluidics (including turbulent flow), coupled heat and mass transfer, anddesign of functional surfaces to an investigation on the molecular level.

1.2 Focus and Structure of this Work

The statement “complexity trumps hardware” still holds, meaning that algo-rithm development contributed at least as much to progress as the developmentof hardware did. A great example are algorithms for long-range electrostatic in-teractions, which has long been considered a problem with inherent quadraticruntime, a complexity that is still today prohibitive for most real-world applica-tions. Only the development of algorithms with O(N logN) complexity allowedMD to become standard, e.g. in molecular biology. To achieve the greatest pos-sible impact, the best algorithms have to run on the best hardware. Both theimplementations of algorithms and the algorithms themselves have to be adaptedto the employed hardware. This is especially true with respect to current proces-sors, considering the increased parallelism on the instruction level and the tasklevel. While in former times each new hardware generation meant higher clockfrequencies and speed-up came for free, current architectures rely on multiplearchitectural improvements, e.g. in the instruction set, and may even featurelower clock frequencies than previous generations. Here, also software needs toevolve in order to keep pace with hardware development.

Consequently, the focus of this work is on the efficient implementation of effi-cient algorithms in MD simulation and their adaptation to current hardware, toachieve best-possible performance. Here, special emphasis is put on the linked-cells algorithm [42] for short-range intermolecular interactions, because it is thecore algorithm of many MD implementations. In particular, its memory-efficientand vectorized implementation on current systems is presented. Due to the pe-culiarities of programming models and interfaces for vectorization, they do notsmoothly fit into existing software. Therefore, their software technical integrationplays an important role in this work. In terms of hardware, high-performance

6

implementations on two platforms, namely the Intel R© Xeon R© E5 processor (co-

denamed Sandy Bridge) and the Intel R© Xeon PhiTM

coprocessor, are presented.The Xeon processor is widely spread in contemporary HPC systems, while thelatter coprocessor can be considered as the intermediate product of the ongoingconvergence of full-blown processors and accelerators. These implementationsare not evaluated with artificial demonstrator codes, but at the example of theabove-mentioned code ls1 mardyn, focusing on large heterogeneous molecularsystems.

The main contribution of this book is a high-performance state-of-the-artimplementation for MD simulations. This implementation enabled the world’slargest MD simulation on SuperMUC, the supercomputer operated by the Leib-niz Supercomputing Centre in Munich, in 2013, thereby defining the state of theart in the field of molecular simulation.

A considerable part of the work presented here has been developed in twocomplementary Ph.D. theses [18, 37] at the chair for Scientific Computing inInformatics at the Technische Universitat Munchen. Although various parts ofthis text are based on these theses, this book provides a unified and completedescription of our efforts and provides even insights beyond the results coveredin [18,37].

Structure This book covers aspects of MD simulations only as far as they arerelevant to process engineering. A concise description of the basic MD algorithmsis contained in Sec. 2. Based on that description, differences between MD simu-lation in chemical engineering and other fields of application are carved out, andthe development of a specialized code is motivated. This motivation is followedby a brief description of the code ls1 mardyn, targeting chemical engineering.Here, the focus is on the structure of ls1 mardyn as it has been found at thebeginning of this work, and we describe the changes that were made to obtainan efficient implementation and a maintainable and extensible software layout.

Sec. 3 gives details on the target platforms, and describes the parallel im-plementation of MD simulation on these systems making use of shared- anddistributed-memory parallelization, including an efficient load-balancing scheme.

Sec. 4 describes the efficient implementation of the compute kernel for theIntel Xeon E5 processor and the Intel Xeon Phi coprocessor. The sliding windowtraversal of the linked-cells data structure forms the groundwork for the followingmemory- and runtime-efficient implementations. It has been a prerequisite fromthe software engineering point of view, and its description is followed by theimplementational details of the compute kernels.

The final section describes extensive benchmarks of the described implemen-tations. In addition to the optimized production version of the simulation codels1 mardyn, a hybrid parallel version simultaneously making use of the IntelXeon Phi coprocessor and the Intel Xeon-based host system is evaluated, as wellas a version specialized on atomic fluids, which is evaluated on up to 146,016cores on the SuperMUC cluster.

7

2 Molecular Dynamics Simulation

In this section, we give a compact description of the basics of MD simulation andcover only topics required to understand MD simulation in process engineering,i.e. in particular molecular modeling, the computation of potentials and forces, aswell as the efficient identification of neighboring molecules. This description helpsto elaborate the differences between MD in process engineering and other fields,thereby focusing on algorithms, and motivates the development of a specializedcode. Such a code is ls1 mardyn which is described at the end of this section.

2.1 Molecular Models and Potentials

The development of molecular models is a non-trivial task. Models have to cap-ture the typical behavior of fluids and the geometric shape of a molecule toallow for meaningful simulations. At the same time, models should be as simpleas possible to be computationally efficient. In this section, we discuss the designspace for molecular models, especially from the point of view of algorithms andimplementation. After a description of the numerical system to be solved by timeintegration, the potential types relevant to ls1 mardyn are introduced.

Fig. 3. Principle of coarse graining: a fully atomistic (left) and united-atom (right)model for butane. Atoms of functional groups are combined into a simple united site,achieving a compromise between computational tractability and microscopic detail.

Molecular Models For computer simulation, a model for a molecule, e.g. apolymer as displayed in Fig. 3, is a basic prerequisite. Depending on the requiredlevel of detail, this can be done in numerous ways. In the simplest case from amodeling point of view, each atom is represented as an individual particle inthe model. Often, it is not an individual atom such as a single H-atom thatdetermines the behavior of a molecule, but rather a group of atoms, e.g. a CH2

group. Thus it is common to combine a group of atoms into one interaction site,which is then called a united atom model. For some purposes it is possible toabstract even further and to unite several atom groups in one interaction site.It is important to decide if positions and orientations of these groups relative to

8

each other are fixed, i.e. if the molecule is rigid or not. This has dramatic influ-ence on algorithms and implementations and also relates to the time span whichcan be simulated, as motion takes place on different time scales. Intra-molecularvibrations such as between C-H atoms are very fast and require very small timesteps, while rotational motion is an order of magnitude slower, only slightlyfaster than translational motion. Consequently, vibrational degrees of freedomreduce the possible simulation time significantly. The coarser such a model is,the computationally cheaper it is, enabling larger or longer simulations. Morecomplex molecular models necessitate more work for parametrization, on theother hand their transferability may be higher. Thus, the decision for a type ofmodel is a trade-off between development effort, transferability and computa-tional efficiency.

Fig. 4(a) shows a molecular model with two sites, which are fixed relatively toeach other, while the model in Fig. 4(b) features internal degrees of freedom, i.e.the bond-length is flexible. The interaction between two interaction sites i andj, separated by a distance rij , can be described by a potential function Uij(rij),which depends on the type of the interaction sites. For flexible molecules, inter-action sites interact with all other interaction sites, including those of the samemolecule. This interaction leads to a force on site i:

Fi =∑j

−∇Uij(rij).

To observe the time evolution of such a system of particles, where each particlehas mass mi, a system of ordinary differential equations has to be solved:

Fi = mi · ri. (1)

One way to keep a molecule or parts thereof rigid is to compute forces foreach interaction site separately, and to impose geometric constraints on bondlengths, bond or torsion angles. These constraints have to be fulfilled by thealgorithm for the time integration. The most common algorithm is the Shakealgorithm [83], which is based on the Stormer-Verlet integration scheme, andalso more sophisticated variants such as QSHAKE [28] or PLINCKS [40] havebeen developed. However, a more efficient way is to compute the torque onmolecule i, resulting from the interactions of its interaction sites n ∈ sitesi, andto integrate the rotational motion. In this model, only forces between sites ofdifferent molecules are computed and the total force on a rigid molecule equals

Fi =∑

j∈particlesj 6=i

∑n∈sitesi

∑m∈sitesj

−∇Unm(rnm).

This force is used to solve Eq. (1). The forces on the sites at distance dn fromthe center of mass at ri yield a torque on the molecule

τi =∑

n∈sitesi

dn × Fn.

9

x

y

d1

d2

ri1

ri2

ri

φ

(a) Rigid model of a molecule: the positions of the interaction sites 1and 2 are fixed relative to the center of mass, only sites of distinctmolecules interact pairwise (dashed lines). The position of all sitesis uniquely determined by position and orientation of the molecule.

x

yri1

ri2

(b) Model of a molecule with internal degrees of freedom: the in-teraction sites 1 and 2 are not fixed, but interact through bondpotentials (dotted lines). The positions of the sites have to bestored explicitly with each site.

Fig. 4. Rigid and flexible model of a simple molecule.

10

Then the system of equations for the rotational motion can be solved

ωi =τiIi,

where ω is the angular acceleration, τ the torque and I the moment of inertia.Here, we remark that the computation of forces on a molecule involves all othermolecules in the simulation, so the complexity isO(N2) for both rigid and flexiblemolecular models.

Many of the fluids targeted in process engineering are composed of compa-rably simple, small molecules (in the following, we use the term particle inter-changeably), which can be approximated by rigid bodies. A rigid model enablesa cheaper implementation of the force computation as well as longer time steps.Since our code is based on rigid-body motion, we describe rigid-body moleculardynamics in more detail. Most other current software packages implement rigid-body MD by constraint motion dynamics, which is less efficient, so this is a keyaspect to distinguish ls1 mardyn from other simulation codes.

Rigid-Body Molecular Dynamics For molecules modeled as fully rigid units,both equations for translational and rotational motion can be solved at once, ifforce and torque on its center of mass are known. In ls1 mardyn, the RotationalLeapfrog algorithm [27] is implemented.

While the orientation of a body can be expressed in Eulerian angles, it ismore convenient to use a quaternion q = (q0q1q2q3)

T, because singularities in

the equations of motion are avoided [52]. The computation of the angular ac-celeration is carried out in a body-fixed coordinate system, i.e. the coordinatesystem is fixed relative to the rotating molecule. This body-fixed coordinatesystem should be chosen such that the mass tensor I is a diagonal matrix, sim-plifying the following equations. From the quaternion, a rotation matrix R(q)can be defined to express a vector, given in the global coordinate system, in thebody-fixed system. The inverse operation is denoted by RT (q).

The rate of change of the angular momentum j equals the torque τ , ∂jdt = τ ,and the angular velocity ω is related to the angular momentum by ω = I−1j.Writing the angular velocity as ω = [0;ω]T , the rate of change of the orientationcan be expressed as

∂q

∂t= Qω, where Q =

q0 −q1 −q2 −q3q1 q0 −q3 q2q2 q3 q0 −q1q3 −q2 q1 q0

.

Similar to the Leapfrog scheme for the translational motion, the angularmomentum j is stored at half time steps n− 1

2 , and the orientations at full time

steps n. Starting now at time n− 12 , the angular momentum jn−

12 is propagated

to time n:

jn = jn−12 +

1

2∆t · τ.

11

It is then rotated to the body-fixed coordinate system:

jn = RT (qn)jn,

and the angular velocity in body-fixed coordinate frame can be determinedcomponent-wise:

ωnα = I−1α jnα.

The orientation is integrated a half time step, where qn+12 = qn+∆t

2 Q(qn)ωn.The remaining steps read [27]:

jn+12 = jn−

12 +∆tτn,

jn+12 = RT (qn+

12 )jn+

12 ,

ωn+ 1

2α = I−1α j

n+ 12

α ,

qn+1 = qn +∆tQ(qn+12 )ωn+

12 .

In the course of these computations, angular velocity and momentum arecomputed at the full time step and can be used to apply a thermostat.

Intermolecular Potentials Theoretically, particle interaction needs to bemodeled by many-body potentials, which take the interactions between n−1 par-ticles into account when determining the potential energy for the n-th particle.As the construction of potential functions is a highly non-trivial task, interactionmodels are simplified to two- or three-body potentials, where the contributionsof all pairs or triples of particles are assumed to be strictly additive. Choosingthe “right” potential functions, this results in much lower computational costwhile sufficient accuracy is maintained. In ls1 mardyn, the following effectivepair potentials are used [64]:

Lennard-Jones-12-6 Potential. This potential models Van-der-Waals attractionand Pauli repulsion and describes uncharged atoms:

U (rij) = 4ε

((σ

rij

)12

−(σ

rij

)6). (2)

Consequently, this potential reproduces properties of noble gases very well,and is both used for the study of ideal fluids as well as a building block forcomplex molecular models. The potential parameters ε and σ are valid only forinteraction sites of the same species. For interactions of two unlike species A andB, their value can be determined by the modified Lorentz combination rule [56]

σAB = ηABσA + σB

2, 0.95 < ηAB < 1.05

12

and the modified Berthelot mixing rule [9, 86]:

εAB = ξAB(εAεB)12 , 0.95 < ξAB < 1.05,

where ηAB and ξAB are empirically determined mixing coefficient. The po-tential can be truncated at a cut-off distance rc, assuming a homogeneous particledistribution beyond rc. This truncated potential, referred to as Truncated-ShiftedLennard-Jones-12-6, allows the construction of efficient algorithms with linearruntime O(N). The error of the potential truncation can be estimated by amean-field approximation to correct the computed quantities. For rigid-bodymolecules the cut-off is applied based on their center of mass.

Electrostatic Potentials. Another basic interaction type are Coulomb interac-tions:

Uqq(rij) =1

4πε0

qiqjrij

, (3)

where 1/(4πε0) is the Coulomb constant, and qi and qj are interacting charges.and rij is the distance between the charges. Charge distributions with zero netcharge may be approximated by higher-order point polarities, i.e. dipoles andquadrupoles as described in [32].

If the net charge of molecules equals zero, these potentials can also be trun-cated at a cut-off distance rc. The effect of the truncation on the potential energycan be estimated by the Reaction-Field method [3, 6].

2.2 Statistical Ensembles

The computation of macroscopic values from microscopic quantities is the field ofstatistical mechanics. In the following we explain the basics as far as necessary tounderstand the implementation in ls1 mardyn and refer to [3] for more details.

The current state of a rigid-body MD simulation can be fully described bythe number of particles, their positions, velocities, orientations and angular mo-menta. From such a configuration, macroscopic quantities such as temperatureor pressure can be computed. Many molecular configurations exist, which mapto the same macroscopic value, i.e. these configurations cannot be distinguishedon the macroscopic level. That set of configurations forms a so-called ensemble.In order to characterize an ensemble, it is sufficient to determine three thermo-dynamic variables, e.g. number of particles N , volume V , and total energy E(NVE). For all other thermodynamic variables fluctuations can occur and theirvalue can be determined through averaging over samples of configurations. Othercommon ensembles fix temperature T (NVT), pressure P (NPT) or the chemi-cal potential µ (µVT). In the thermodynamic limit, i.e. for infinite system sizes,these different statistical ensembles are equivalent for homogeneous systems andbasic thermodynamic properties can be computed as averages:

13

– The total energy E is computed as the sum of the ensemble averages of thepotential energy Upot and the kinetic energy Ekin:

E = 〈Upot〉+ 〈Ekin〉 =

⟨∑i

∑j>i

U(rij)

⟩+

⟨∑i

1

2miv

2i

⟩.

– Following the virial theorem [16], the temperature T can be computed as

T =

⟨1

3NfkB

N∑i=1

v2imi

⟩.

Here, Nf denotes the number of molecular degrees of freedom in the simu-lation, and kB the Boltzmann constant.

– The pressure P can be split in a ideal part and a configurational or virialpart and computed as

P = 〈P ideal〉+ 〈P conf〉 = 〈ρkBT 〉 −⟨

1

3V

∑i

∑j>i

rij · fij⟩,

where rij denotes the distance and fij the force between interaction sites iand j.

Simulations in the NVE ensemble are most self-evident. Energy is kept con-stant automatically, as solving the Newtonian equations conserves energy andmomentum. To exclude boundary effects and to minimize finite-system effects,periodic boundary conditions are typically imposed on simulations. If particlesleave the domain on one boundary, they enter the domain via the oppositeboundary again, so the number of particles does not change as well as the vol-ume.

For NVT simulations, a thermostat is needed to keep the system at constanttemperature. Conceptually, this is achieved by coupling the simulated systemto an external heat bath, so that a weak exchange takes place, without dis-turbing the system under consideration. While several algorithms have beenproposed [44] and especially relaxation schemes are popular, a very simple andeffective method is to scale all particle velocities by a factor

β =

√T target

T current.

Velocity scaling does not strictly preserve the NVT ensemble, however itis often used assuming that the simulated system is not severely disturbed bythe thermostat [101]. To conserve all these ensembles, modified time integrationschemes have been developed, which solve the equations of motion in a suit-able way. In ls1 mardyn, the thermostated version of the Rotational Leapfrogalgorithm [27] is used for NVT simulations.

14

2.3 Algorithms for Nearest-Neighbor Search

As already noted in the explanation of the truncated-shifted Lennard-Jones po-tential, it is often possible to truncate potentials at a cut-off radius rc, such thatpotential and force on a particle depend only on its local neighborhood. Theefficient identification of that neighborhood can bring down the runtime com-plexity from O(N2) to O(N), allowing for an asymptotically optimal algorithm.In the following, we discuss algorithms commonly implemented in MD codes.

Direct Summation. This is the simplest implementation of neighbor search. Thedistances between all particle pairs are computed, and only for those separated byless than rc interactions are computed. While still quadratic runtime complexityis maintained, this is the most efficient algorithm for small particle sets, as it doesnot incur overhead for the particles’ organization. Additionally, it can easily bevectorized and parallelized, and well-known optimizations such as cache blockingcan be applied, such that the implementation becomes truly compute boundand achieves a high fraction of peak performance. Direct summation becameespecially popular for implementations on GPGPUs, due to its simplicity andinherently high parallelism, which fits well to the architecture and programmingmodel.

Verlet Neighbor Lists [99]. Another frequently used approach are Verlet neighborlists, shown in Fig. 5. For each particle, pointers to molecules in a “skin” radiusrc + ∆r are stored in a list. In order to find all neighboring particles withindistance rc, only the particles in the list have to be checked. Depending on themovement of the particles and the value of ∆r, that neighbor list has to beupdated every n time steps, to make sure it contains all neighboring molecules.In principle, this update is of O(N2) complexity, as again the mutual distancebetween all particles has to be computed, so modern implementations combineit with the linked-cells algorithm as explained in the next paragraph to achievelinear runtime.

rcutoff +Δr

rcutoff

Fig. 5. Schematic of the Verlet neighbor list.

15

It is an obvious advantage of the neighbor lists that they approximate thegeometry of the cut-off sphere well, and only few unnecessary distance compu-tations have to be performed. On the other hand, the complexity of the im-plementation is slightly higher, as both the linked-cells and the neighbor listmethod have to be implemented. Verlet lists are most efficient in static scenar-ios, where the movement of particles between time steps is very slow. This holds,e.g. for simulations at very low temperatures, at small time steps, or generally inthe simulation of solid bodies such as crystals. The overhead for large number ofpointers per molecule, maybe even up to a few hundreds, has to be considered aswell. While it is usually not a serious issue on current computers, there is a cleartrade-off between memory and computational overhead. A more severe questionis the runtime efficient implementation of neighbor lists on current hardware.Pointers don’t preserve locality, as it is required for vectorization. In contrastto earlier vector computers, today’s vector architectures do not support gather-/ scatter-operations efficiently yet, which would facilitate the implementation.Apart from that, the multiple memory accesses traversing the neighbor list canseriously degrade performance on current architectures [72].

Linked-Cells Algorithm [42, 88]. As depicted in Fig. 6(a), the computationaldomain is subdivided into cells of length rc. In every time step, the particles aresorted in these cells according to their spatial coordinates. In order to identifyall particles within the cut-off radius around a given particle, only 8 neighboringcells in 2D or 26 cells in 3D as well as the cell of the particle itself have tobe searched. Assuming a homogeneous particle distribution, each cell containsNc particles, where c denotes the number of cells, so the distance computation

can be done in O(N), as long as the particle density is kept constant. As wewill see later, this algorithm is inherently cache-friendly, as for each cell (N/c)2

computations are performed.

r

c

r

c

(a) Schematic of the originallinked-cells idea with edgelength l = rc.

r

cutoff

(b) Schematic of the generalizedlinked-cells algorithm withedge length l = rc

2.

Fig. 6. Standard linked-cells algorithm and generalized linked-cells algorithm.

16

Since its invention, a lot of work has gone into the optimization of the linked-cells algorithm. Evident from simple geometric considerations, roughly 78 % ofthe particle distance computations are actually wasted, because the particlesare separated by a larger distance than rc. One refinement is the generalizedlinked-cells algorithm, which chooses cells of smaller size to better approximatethe geometry of the cut-off sphere, cf. Fig. 6(b). Thereby, the volume that hasto be searched for neighboring particles is decreased as well as the number ofdistance computations [8]. The efficiency of such schemes has also been inves-tigated in [95], amongst others. Buchholz [14] extended this idea by choosingthe size of the cells adaptively, depending on the density of the fluid: for regionswith high number density, smaller cells are chosen to decrease the overheadof distance computation, for regions with low number density, larger cells arechosen to avoid the overhead associated with many small cells. That schemeis called adaptive linked-cells algorithm. A different optimization technique isinteraction sorting [31], where for each cell pair the coordinates of the particlesare projected onto the vector connecting the two cell centers. Then the particlesare sorted according to their position on that vector. In that way, the distancebetween particles needs to be computed only as long as the distance along thevector of the cell centers is smaller than rc, greatly reducing the number of su-perfluous computations. A summary and comparison of the different approachescan be found in [100].

In comparison to direct summation, some overhead occurs due to the cell datastructure and its update. An advantage is the seamless integration of periodicboundary conditions, as depicted in Fig. 7. The cell structure is extended byone cell layer at each boundary. Every time step, the particles of the oppositeboundary are replicated in these so-called halo cells. These particles are usedduring the force computation and deleted thereafter.

periodic halo copy

Fig. 7. Implementation of periodic boundary conditions: after each time step particlesin the boundary cells are copied into the halo cells on the other side in the samecoordinate direction.

17

Linked-Cells: Parallelization based on Spatial Domain Decomposition. In a sim-ilar way, parallelization based on spatial domain decomposition fits especiallywell to the linked-cells data structure. Here, the computational domain is sub-divided according to the number of processes, so that equally sized sub-domainsare assigned to each process. Similar to the integration of periodic boundaries,each sub-domain is extended by one cell layer, which contains molecules resid-ing on neighboring processes, which have to be communicated every iteration.Depending on the implementation, forces for each molecule pair crossing a pro-cess boundary have to be communicated, or are computed redundantly on eachprocess. Spatial domain decomposition can be efficiently combined with load-balancing scheme, its implementation will be topic of Sec. 3.

2.4 Characteristics of Large-Scale Molecular Dynamics Simulationof Fluids

A number of well-known simulation programs for molecular simulation existsuch as NAMD [69], GROMACS [41], Desmond [11], Charmm [13], Amber [84],Espresso [5] or Lammps [70]. Most of them having their background in molecularbiology, these codes can be used for simulations in process engineering in prin-ciple. Then however, the application of these tools may not be straight-forward,require odd work-flows and lack computational efficiency, rendering these toolsnon-optimal. In the following, we discuss properties of MD simulations in biologyand in chemical engineering. On the basis of the preceding sections, we outlinealgorithmic differences and contrast requirements on implementations for eachfield of application.

MD Simulation in Molecular Biology. A typical use case of MD simulation inbiology is the study of protein folding, where the probability of conformationsis determined. Such simulations deal with only few but large macromolecules toobserve conformational changes. Since results are investigated in lab experimentsin more detail, MD is used to dramatically narrow the search space for labexperiments.

In explicit solvent-type simulations, macromolecules float in a large homoge-neous bulk of water or aqueous salt solution, which is the natural environmentof proteins. In order to mitigate finite-size effects (e.g. the interaction of a pro-tein with its own periodic image), a sufficiently large simulation box filled withwater has to be simulated [15]. Yet the total number of molecules is comparablysmall and typically in the order of 1,000–10,000. These simulations have to takeplace at the atomic level, slightly increasing the number of bodies to be dealtwith, e.g. by a factor of three in the case of TIP3P water. Due to intra-molecularvibrational motions, small time steps have to be chosen. Simulation parametersare standardized to a high degree, e.g. computer experiments are run at am-bient temperature, employing TIP3P, TIP4P, or SPCE water models [41, 69].Every atom may participate in a number of interactions of different type, e.g.non-bonded electrostatic or Lennard-Jones interactions and bonded interactions.These characteristics strongly influence established simulation codes.

18

Consequences for Biomolecular Simulation Codes. Algorithms and their imple-mentations are chosen to match with these properties as well as possible. Forseveral reasons, Verlet neighbor lists are the method of choice for neighbor searchin all the aforementioned simulation packages. Typical scenarios don’t exhibitstrong dynamics such as flows, and comparably small time steps due to the in-ternal degrees of freedom have to be applied. Therefore, movement of atomsbetween two time steps is limited, which is favorable for Verlet lists. Moreover,bonded atoms have to be excluded from non-bonded interactions. This can beaccomplished with exclusion lists, which in turn integrate nicely with neighborlists. In contrast, computation with the linked-cells algorithm might require mul-tiple force evaluations [34, p. 203]. Due to the high level of standardization, forcefields can be supplied in form of libraries. For commonly used solvents such asTIP3P or TIP4P, GROMACS offers specially tuned interaction kernels boost-ing performance. The consideration of internal degrees of freedom results in ahigher arithmetic intensity per atom: First of all, neighbor search is based onatoms instead of molecules, increasing computational complexity by a constantfactor, e.g. nine in the case of a three-site water model. In addition, constraint-motion algorithms such as Rattle, Shake or P-LINCS have to be applied, whichare computationally more expensive and impair scalability.

Due to the presence of ions it is necessary to treat long-range coulomb inter-actions with appropriate methods. As the number of molecules is rather small,Ewald summation techniques are implemented as standard methods, and FFT-accelerated Ewald techniques seem to be optimal. Because of the comparablylow particle count, it is common to write trajectory files for each time step ofthe simulation and to investigate quantities of interest in a post-processing step.

MD Simulation in Chemical Engineering. In chemical engineering, MD simula-tions are used, e.g. to predict thermodynamic properties of mixtures of fluids, sothese predictions have to match real data quantitatively with high precision [24].

While the simulation of a bulk of solvent is an unwanted necessity in biolog-ical applications, it is now the main purpose, and interest focuses on the com-putation of macroscopic properties such as transport coefficients or nucleationrates. To reduce statistical uncertainties and finite size effects, large numbers ofmolecules up to several millions are required. Applications cover a wide range ofthermodynamic states, e.g. very high or low pressures and temperatures. Often,molecular force fields have to be developed to correctly reproduce properties inthese ranges [61]. For many applications, it is sufficient to model fluids composedof comparably simple, i.e. rigid molecules without internal degrees of freedom.Finally, applications such as phase transitions or processes at the liquid-vaporinterface are characterized by strongly heterogeneous particle distributions.

Consequences for Simulation Codes in Engineering. Rigid molecular models sim-plify both computations of intermolecular interactions, as well as the solution ofthe equations of motion. The cut-off condition is not evaluated per atom, but fora whole molecule based on its center of mass, reducing the number of distancecomputations. Rigid molecules are uniquely assigned to a process, which reduces

19

the complexity of an efficient parallelization. For molecules with internal degreesof freedom, atoms may reside on different processes, which requires additionalcommunication and synchronization. Due to the rigidity, larger time steps arepractical, allowing for longer simulation times in the end.

The number of molecules required for a meaningful simulation in chemicalengineering can be larger by magnitudes, and has tremendous effects. While Ver-let neighbor lists still may be usable, the linked-cell algorithm is a better choice,as memory overhead for storing pointers is avoided. Moreover, the simulationof flows or nucleation exhibit higher dynamics of molecules, so neighbor listsneed to be rebuilt frequently, especially in combination with larger time steps.Due to the particle number it is advisable to compute statistical data on-the-fly,instead of storing the particles’ trajectories and running tools for post-analysis.While this is feasible for a scenario with e.g. 10,000 molecules, i/o becomes abottleneck for large-scale simulations with millions of particles. Post-processingtools would need to be parallelized to handle large amounts of data efficiently,so implementations are of similar complexity as the actual simulation code.

Particle distributions cause severe load imbalances. The density of liquid dif-fers from that of gas roughly by two orders of magnitude. As the computationaleffort scales quadratically with the density, liquid phases are about 10,000 timesas compute intensive as gas phases. For some processes such as nucleation, thedistribution of particles evolves dynamically in an unpredictable manner, so anefficient dynamic load balancing scheme is needed [14]. Dealing with hetero-geneities can be supported by the choice of spatially adaptive algorithms, suchas the adaptive linked-cells algorithm.

Concluding, requirements on codes for simulation in engineering are differentfrom those for codes in biology. While the well-established codes for simulationin biology or chemistry can be used for the simulation of processes in chemicalengineering, the characteristics of such simulations are quite different. In order toallow for high usability and to boost computational efficiency, codes specificallytailored to their field of application are essential.

2.5 Simulation Code Mardyn

Tackling the field of process engineering, the simulation code ls1 mardyn [1,64]has now been developed for about a decade. Main contributors have been thegroups at the High Performance Computing Center Stuttgart (HLRS), at theChair for Thermodynamics and Energy Technology (ThET) at University ofPaderborn, and the Laboratory for Engineering Thermodynamics (LTD) at theUniversity of Kaiserslautern as well as the Chair for Scientific Computing inComputer Science (SCCS) at Technische Universitat Munchen.

The development has been inspired by ms2 [17], a mature Fortran code forthe molecular simulation of thermodynamic properties. Supporting both classicalMD and MC simulation with rich functionality, ms2 focuses on small molecularsystems, so the investigation of nucleation or flow processes is hardly possible.

20

Also the investigation of competing domain specific codes such as Towhee5 orGIBBS6 confirmed that codes for large-scale MD simulation in engineering sci-ences were rather limited [35]. Therefore, the work on a modern C++ code forlarge-scale parallel MD simulation was started.

In the current software layout, two design principles are dominating: Thefirst is Separation of Concerns. Modular design is a key requirement for sev-eral reasons. First of all, academic partners from different disciplines develop thecode at different geographic locations. Optimally, modifications or additions offeatures affect only small parts of the code, facilitating distributed software de-velopment. Furthermore, technical aspects should be separated from applicationspecific aspects. E.g., a chemical engineer implementing the computation of newstatistical quantities should not need to understand details of parallelization. Inacademic software development, developers change rather frequently, so modulardesign makes it easier to focus on specific aspects of MD simulation without therequirement to understand all parts of the software. Finally, modularity fostersthe exchange of algorithms and their implementations.

The second principle is one code base for sequential and parallel ex-ecution, rather than having two distinct codes. Targeting parallel simulations,software development is simplified, if code can be developed, executed, and to acertain extent also tested sequentially. Maintenance of a single code base is lesserror-prone than having two similar codes, which need to be kept synchronous.Alternatively, sequential code could be interleaved with pre-compiler directives,which can make code harder to understand. Therefore, parts directly related toparallelization are hidden behind interfaces, and application specific classes areimplemented and tested independently of types of parallelizations.

Software Structure Although the above two design principles have not beenstrictly realized, they are heavily reflected in the software design. An UML di-agram containing the main components of ls1 mardyn is shown in Fig. 8. Theclass Simulation is the central component, so all other classes are arrangedaround it. The main components and their relation is discussed in the following.

Package parallel. This package comprises everything related to parallelizationbased on the domain decomposition scheme explained in Sec. 3. Its interface isdefined by DomainDecompBase, which is responsible for particle exchange andinter-process communication. DummyDomainDecomposition provides an imple-mentation of this interface for sequential execution. DomainDecomposition isthe standard domain decomposition method for MPI and KDDecomposition isan implementation providing load balancing based on KD-trees for MPI. In thecase of sequential compilation, the latter two implementations are excluded.

Package io. Io provides two interfaces for file input and output. The methodreadPhaseSpace of InputBase reads the phase space, i.e. the definition of

5 http://towhee.sourceforge.net/6 http://www.materialsdesign.com/medea/medea-gibbs

21

http://towhee.sourceforge.net/

http://www.materialsdesign.com/medea/medea-gibbs

Fig. 8. Software layout of Martyn: the packages for parallelization,

particleContainer, io, integration and the molecular model are centeredaround the main class Simulation.

22

molecule types together with positions, orientations and velocities of molecules.An OutputBase writes files containing e.g. visualization or restart data.

Package particleContainer. This package contains data structures for moleculestorage and traversal. Thus, the main characteristics of a ParticleContainer

are that molecule pairs can be traversed according to the cut-off radius. Initially,two implementations for the standard linked-cells algorithm and its adaptive ver-sion existed. Both organize particles with the help of ParticleCells. Duringthe traversal of particle pairs, a ParticlePairsHandler is called for each pairwith distance smaller than the cut-off radius rc. Implementations of that in-terface compute interactions (ParticlePairs2PotforceAdapter) or determinethe computational load associated (ParticlePairs2LoadCalcAdapter) in thecontext of load balancing.

Package integrator. Though providing an interface, only the Leapfrog integratoris supported at the moment. Implementing the Leapfrog Rotational Algorithm,it solves the molecules’ equations of motion every time step.

Classes Domain and Ensemble. These classes are designed to contain all appli-cation specific aspects such as evaluation of thermodynamic properties or enforc-ing the correct statistical ensembles. Consequently, the computation of energies,pressure, profiles, and long-range corrections is found here.

Package molecules. The implementation of the molecular model is based on theFlyweight design pattern [29]. Usually there is a large number of molecules of thesame type in a simulation. A type or Component describes the number of LJSitesor electrostatic interaction sites and the respective potential parameters. EachSite stores its position relative to the molecule, so its absolute global positionhas to be computed from the position and orientation of the molecule. For eachcomponent in the simulation, exactly one object is created and referenced by allmolecules of the same type.

Class Simulation. This class is the heart piece of ls1 mardyn and ties togetherthe different parts or the code. It is responsible for setting up a simulation runand executing the simulation loop, see the pseudo code Lst. 1.1.

After the initialization, i.e. after reading the phase space and creating a do-main decomposition with initial particle exchange, the main loop is executed.First the integrator performs a half-step time integration to promote the veloc-ities to the full time step, then halo-particles are exchanged and load balancingmay take place. Then forces and potential are calculated, the thermostat as wellas computations of other thermodynamic statistical quantities are applied. Atthe end of the loop, the integrator performs the second half-step integration, andi/o is performed.

23

Listing 1.1. Pseudo-code of the main simulation loop in the class Simulation.

inputReader−>readPhaseSpace ( ) ;domainDecomposition−>balanceAndExchangeMolecules ( ) ;

f o r ( i < numberOfTimesteps ) {

i n t eg ra t o r−>eventNewTimestep ( ) ;domainDecomposition−>balanceAndExchangeMolecules ( ) ;conta iner−>t r a v e r s e P a r t i c l e P a i r s ( pairs2ForceAdapter ) ;thermostat ( ) ;i n t eg ra t o r−>eventForcesCa lcu lated ( ) ;

f o r ( k < numberOfOutputPlugins ) {outputPlugin [ k]−>doOutput ( ) ;

}}

24

3 Parallelization of MD Algorithms and Load Balancing

Due to the enormous computational requirements of MD simulations, also ef-ficient parallelization techniques are required. This chapter describes the effi-cient parallel implementation of ls1 mardyn for shared-memory and distributed-memory architectures. Before diving into the algorithmic details, we start in thenext subsection, Sec. 3.1, with highlighting similarities and discussing differencesbetween common recent supercomputer building blocks. This subsection lays outthe basis for the current and the next section and allows us to re-engineer MDapplications in such a way that they can leverage the features of all these plat-forms by abstracting the different hardware concepts to their governing designprinciples: heterogeneity, massive amounts of cores and data parallelism.

Following that description, we describe a highly scalable shared-memory par-allelization, mainly targeted at the Intel Xeon Phi coprocessor. Using sharedmemory, redundant force computations and additional computation as alreadymentioned in the preceding section can be avoided, and are a core requirementto achieve good strong scalability. In current software packages Domain decom-position [70] is most commonly implemented, especially for parallelization withMPI. It is also used in ls1 mardyn and strongly influenced its design, so it isdescribed in the following. One challenge with domain decomposition is howto decompose the computational domain to ensure that each process has equalload. ls1 mardyn’s solution to this issue is the topic of Sec. 3.4.

3.1 Target Systems

In the late 70s the Intel 8086 was introduced to market, representing the firstprocessor of the so-called x86 architecture. Today more than 85% of systemslisted in the TOP500 list [62] ranking the world’s most powerful computers arebased on the x86 architecture. Additionally, basically every desktop or notebookcomputer as well as smaller cluster installations at universities rely on x86 astheir main computing engine. In the following, we will have a closer look at IntelCPUs often used in today’s computers. We cover the Intel Xeon E5 processor(based on the Sandy Bridge microarchitecture) released in 2011 for servers andthe Intel Xeon Phi coprocessor which is broadly available since early 2013.

Intel Xeon E5 The Sandy Bridge microarchitecture implements a major re-fresh: it features several changes for increasing single-thread performance andthe new vector instruction set Advanced Vector Extensions (AVX) instructions.With AVX the width of the SIMD vector registers was doubled, which in the-ory leads to a two times higher peak performance than its predesessor. SandyBridge’s server version features up to eight cores and 20 MB of level 3 cachewhich makes it a perfect basis for a powerful supercomputer, see Fig. 9. Theseprocessors are called Intel Xeon E5 series. Most significant changes, from a plat-form perspective, are the increased memory bandwidth (102.4 GB/s) and the twotimes higher inter-socket bandwidth (16 GT/s) compared to Nehalem. These en-hancements aim at increasing the scalability of the platform.

25

Thr.

0

L2

Shared L3 Cache

20MB

Thr.

1

Thr.

2

L2

Thr.

3

Thr.

4

L2

Thr.

5

Thr.

6

L2

Thr.

7

Thr.

8

L2

Thr.

9

Thr.

10

L2

Thr.

11

Thr.

12

L2

Thr.

13

Thr.

14

L2

Thr.

15

DRAM

4 CH DDR3 1600

Shared L3 Cache

20MB

DRAM

4 CH DDR3 1600

L1

L1

L1

L1

L1

L1

L1

L1

Th

r. 0

Th

r. 1

Th

r. 2

Th

r. 3

Th

r. 4

Th

r. 5

Thr.

6

Thr.

7

Th

r. 8

Th

r. 9

Thr.

10

Thr.

11

Thr.

12

Thr.

13

Thr.

14

Thr.

15

L2

L2

L2

L2

L2

L2

L2

L2

QP

I 8.0

GT

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

L1

L1

L1

L1

L1

L1

L1

L1

QP

I 8.0

GT

Fig. 9. Schematic overview of the Intel enterprise platform. It consists of two sock-ets mounting one Xeon E5 processor each. This figure shows the maximum possibleconfiguration with two eight-core chips.

In addition, the Intel Xeon E5 processor comprises a ring interconnect ca-pable of scaling to link up to 20 cores on a single die through a shared L3cache. A significant portion of the electrical design of previous rings has beenused. However, much of the higher-layer cache coherency protocols have beenre-designed. The interconnect consists of at least four rings, a 32-byte data link,and separate rings for requests, acknowledgments and snooping. The rings areoverlaid on the design of the Last Level Cache (LLC). This cache is subdividedinto separate units, with one unit for each x86 core. It requires one clock cyclefor data to progress one agent on the ring. This fast interconnect allows moreefficient data sharing between cores, providing the throughput required for theprocessor’s 256-bit floating-point vector units (AVX units).

In order to execute AVX code with high performance and to increase thecore’s instructions per clock (IPC), major changes to the previous core microar-chitecture have been applied. These changes are highlighted in orange in Fig. 10.Since the SIMD vector instruction width has been doubled with AVX, conse-quently the load port’s (port 2) width needs to be doubled. However, doublinga load port’s width would impose tremendous changes to the entire chip archi-tecture. In order to avoid that, Intel changed two ports by additionally imple-menting in each port the other port’s functionality as shown for port 2 and 3.Through this trick, the load bandwidth has been doubled and the VPU’s demandcan be better saturated. Changes to the ALUs are straightforward: ports 0,1 and5 are doubled while providing classic SSE functionality for AVX instructions andextensions for mask operations.

26

Scheduler (physical registerfile: 144 256bit VPU registers, 160 integer registers)

Port 0 Port 1 Port 5 Port 2 Port 3 Port 4

ALU

SSE MUL

SSE Shuffle

DIV

AVX FP MUL

Imm Blend

ALU

SSE ADD

SSE Shuffle

AVX FP ADD

ALU

JMP

AVX FP Shuffle

AVX FP Boolean

Imm Blend

Load

Store AddressStore Address

LoadStore Data

Memory Control

32 KB L1 Data Cache

256 KB

L2

Cache

(MLC)

32 KB L1 Instruction Cache

Instruction Fetch and Pre-Decode

4-way Decode

Branch

Prediction

Rename/Allocate/Retirement (Reorder-Buffer: 168 Entries)

uOP decoded

Cache (1.5k

uOps)

Fig. 10. The core block-diagram of the Intel Xeon E5 processor. In comparison to theprevious generation a cache for decoded instructions has been added, ports 2 and 3have been enhanced, and a physical register file has been added for better performancewhen executing AVX code.

However, this bandwidth improvement still does not allow for a perfect lever-age of AVX instructions using all 256 bits (AVX256) as this would require a 64bytes per cycle load and 32 bytes per cycle store bandwidth. These results inAVX applications which rely on stream from L1 Cache (such as small vector op-erations) are limited to half of the cores peak performance. However, compute-bound kernels such as matrix multiplication can fully exploit the AVX instructionset. This required L1 bandwidth increase has been implemented with the Haswellmicroarchitecture (which is not under investigation in this work). Due to 32bytes load bandwidth and the non-destructive AVX128 instruction set, AVX128codes can often yield the same performance as AVX256 on the Sandy Bridge mi-croarchitecture but perform much better than SSE4.2 on equally clocked chipswithout AVX support. This can also be attributed to the fact that 16 bytes loadinstructions have a three times higher throughput (0.33 cycles) than 32 bytesload instructions (for 32 bytes loads, ports 2 and 3 have to be paired and cannotbe used independently).

Up to the Xeon E5 processor each unit featured dedicated memory for stor-ing register contents to execute operations on them. This solution has manyadvantages (see [39] for details) but requires a lot of chip-space, i. e., transistors.However, with AVX such a register allocation would be too expensive in terms oftransistors required, therefore a so-called register file (see [39]) has been imple-mented. Register contents are stored in a central directory. Shadow registers andpointers allow for an efficient pipeline implementation. In case of the Xeon E5,144 256-bit SIMD registers are used to support 16 ymm-registers (AVX registers)visible at Instruction Set Architecture (ISA) level. Furthermore, a general per-formance enhancement has been added: a cache for decoded instructions. This

27

trace-cache-like cache, see [81], boosts kernels with small loop bodies, e.g. fromlinear algebra, and allows applications to obtain up to 90% or even more of theprovided peak performance.

We want to close the description of the Xeon E5 processor by describing theSuperMUC7 that is operated at the Leibniz Supercomputing Centre in Munich.This system features 147,456 cores and is at present one of the biggest pure XeonE5 systems worldwide, employing two Xeon E5-2680 Sandy Bridge processorsper node, with a theoretical double-precision peak performance of more than 3PFLOPS, ranked #9 on the June 2013 Top500 list. The system was assembledby IBM and features a highly-efficient hot-water cooling solution. In contrast tosupercomputers offered by Cray, SGI or IBM’s own BlueGene, the machine isbased on a high-performance commodity network: an FDR-10 Infiniband prunedtree topology by Mellanox. Each of the 18 leafs, or islands, consists of 512 nodeswith 16 cores each at 2.7 GHz clock speed sharing 32 GB of main memory. Withinone island, all nodes can communicate at full FDR-10 data-rate. In case of inter-island communication, four nodes share one uplink to the spine switch. Since themachine is operated diskless, a significant fraction of the nodes’ memory has tobe reserved for the operation environment.

Intel Xeon Phi coprocessor Intel Xeon Phi coprocessor is a co-processormounted on a PCIe expansion card, shown in Fig. 11. It is a many-core ar-chitecture based on the first Pentium-generation processor. These 20-year oldprocessors have been enhanced by the standard x86 64 bit instruction set andcombined with a powerful VPU featuring 512-bit wide SIMD vectors and there-fore doubling the width of AVX. The first commercially available version of XeonPhi coprocessor was code-named Knights Corner and has up to 61 cores and isthe first silicon implementing Intel’s Many Integrated Core (MIC) architecture.Especially MD simualtions can potentially benefit from these massively parallelco-processors devices.

Each core may execute up to four hardware threads with round robin schedul-ing between instruction streams, i. e., in each cycle the next instruction streamis selected. Xeon Phi uses the typical cache structure of per-core L1 (32 KB) andL2 (512 KB) caches. The shared L2 cache with a total of 30.5 MB (61 cores) usesa high-bandwidth ring bus for fast on-chip communication. An L3 cache does notexist due to the high-bandwidth GDDR5 memory (352 GB/sec at 2570 MHz).Since Xeon Phi follows the key principles of an x86 platform, all caches and thecoprocessor memory are fully coherent. On the card itself a full Linux operatingsystem is running.

Therefore, code can be run by either using a GPU-style offload code layoutor it can be executed directly. Since the card offers Remote Direct Memory Ac-cess (RDMA), applications can directly exchange data with peripherals such asnetwork cards. Therefore a compute cluster equipped with Intel Xeon Phi co-processors can be used in various ways. It is not limited to using the coprocessoras an add-in device which is just suitable for offloading highly-parallel sub-parts

7 http://www.lrz.de/services/compute/supermuc/

28

http://www.lrz.de/services/compute/supermuc/

GDDR5 Channel

L2

Core

VP

U

01

L1

L2

Co

re

VP

U

02

L1

L2

Core

VP

U

03

L1

L2

Core

VP

U

04

L1

L2

Core

VP

U

05

L1

L2

Core

VP

U

06

L1

L2

Core

VP

U

07

L1

L2

Core

VP

U

08

L1

L2

Core

VP

U

08

L1

L2

Core

VP

U

10

L1

L2

Core

VP

U

11

L1

L2

Core

VP

U

12

L1

L2

Core

VP

U

13

L1

L2

Core

VP

U

14

L1

L2

Core

VP

U

15

L1

L2

Co

re

VP

U

16

L1

L2

Core

VP

U

17

L1

L2

Core

VP

U

18

L1

L2

Core

VP

U

19

L1

L2

Core

VP

U

20

L1

L2

L1

21

Core

VP

U

L2

L1

22

Co

re

VP

U

L2

L1

23

Core

VP

U

L2

L1

24

Core

VP

U

L2

L1

25

Core

VP

U

L2

L1

26

Core

VP

U

L2

L1

27

Core

VP

U

L2

L1

28

Core

VP

U

L2

L1

29

Core

VP

U

L2

L1

30

Core

VP

U

L2

L1

31

Core

VP

U

L2

L1

32

Core

VP

U

L2

L1

33

Core

VP

U

L2

L1

34

Core

VP

U

L2

L1

35

Core

VP

U

L2

L1

36

Co

re

VP

U

L2

L1

37

Core

VP

U

L2

L1

38

Core

VP

U

L2

L1

39

Core

VP

U

L2

L1

40

Core

VP

U

L2

Core

VP

U

60

L1

L2

Co

re

VP

U

59

L1

L2

Core

VP

U

58

L1

L2

Core

VP

U

57

L1

L2

Core

VP

U

56

L1

L2

Core

VP

U

55

L1

L2

Core

VP

U

54

L1

L2

Core

VP

U

53

L1

L2

Core

VP

U

52

L1

L2

Core

VP

U

51

L1

L2

Core

VP

U

50

L1

L2

Core

VP

U

49

L1

L2

Core

VP

U

48

L1

L2

Core

VP

U

47

L1

L2

Core

VP

U

46

L1

L2

Co

re

VP

U

45

L1

L2

Core

VP

U

44

L1

L2

Core

VP

U

43

L1

L2

Core

VP

U

42

L1

L2

Core

VP

U

41

L1

GDDR5 Channel

GDDR5 Channel

GDDR5 Channel

GDDR5 Channel

GDDR5 Channel GDDR5 Channel GDDR5 Channel

Ringbus

Ringbus

PCIe Controller

Fig. 11. Sketch of an Intel Xeon Phi coprocessor with 60 cores which are connected bya fast ring bus network. Each core has 4-way SMT and features a 512-bit-wide vectorunit.

29

of applications to. Figure 12 sketches the possible usage models of an Intel XeonPhi cluster. Such a cluster offers five different usage models:

CPU-hosted: a classic CPU cluster without any coprocessor usageoffload to MIC: here, highly-parallel kernels are offloaded to the Intel Xeon

Phi coprocessorsymmetric: here, on both devices (CPU and MIC) MPI tasks are started, and

the communication between the host and the coprocessor is transparent dueto the message passing programming paradigm, however, sophisticated loadbalancing may be required

reverse offload to CPU: if an application is in general highly-parallel but suf-fers from small and rare sequential parts, theses parts can be “offloaded” backto the CPU

MIC-hosted: suitable for highly-parallel and scaling codes. The CPUs are notused anymore as the application runs entirely on the coprocessor cards.

The major performance leap is accomplished through the very-wide vectorunits. These offer an enhanced feature set even compared with AVX which issupported with Intel Xeon E5 CPUs and above. Xeon Phi has full support forgather/scatter instructions and every instruction can be decorated with permu-tations and store masks. Therefore Intel MIC is the first x86 processor imple-menting a complete RISC-style SIMD vector instruction set which allows theprogrammer to express richer constructs.

As it is based on x86, the Intel MIC architecture can support all program-ming models that are available for traditional processors. Compilers for MICsupport Fortran (including Co-Array Fortran) and C/C++. OpenMP [66] andIntel Threading Building [79] Blocks may be used for parallelization as well asemerging parallel languages such as Intel Cilk Plus or OpenCL. Furthermore,as the VPU instruction set is closely related to AVX, the Intel compiler canautomatically generate MIC-native vector code when it already vectorizes thecode for AVX.

The MIC-hosted and symmteric modes are very important as it offers pro-grammers the ability to leverage the power of Xeon Phi without programming

CPU

CPU-centric MIC-centric

CPU-hosted offload to MIC symmetric offload to CPU MIC-hosted

Main()

Compute()

Comm()

MIC

Main()

Compute_m()

Comm()

Main()

Compute()

Comm()Compute_h()

Compute_m()

Main()

Compute()

Comm()

Main()

Compute_h()

Comm()

Main()

Compute()

Comm()

Fig. 12. Intel Xeon Phi usage models.

30

it as an offload coprocessor. Unfortunately, the power of Intel Xeon Phi cannotbe fully unleashed when using a cluster in these two modes. This is due to verylow data transfer bandwidths between MIC cards. Those are caused by the IntelXeon E5 processor when being used as a PCIe bridge since the internal bufferswere not optimized for such a use case. Both Xeon Phi coprocessor boards, mic0and mic1, are attached to the same Xeon E5 processor host. In all cases were onecoprocessor communicates with the host, nearly the full PCIe bandwidth canachieved. However, when directly communicating between both PCIe boards,the bandwidth is limited to 1 GB/s. If the both coprocessors are mounted in dif-ferent sockets the bandwidth even decreases to 250 MB/s since the Quick PathInterconnect (QPI) agents of the Xeon E5 processors have to be involved. If theamount of transferred data is big enough (more than 128 KB) it is worthwhileto implement a so-called proxy application that runs on the host and handlesall Infiniband transfers, cf. [45, 71]. Please note, later versions of the Xeon E5processor, e.g. v2 or v3, fix many of these limitations.

3.2 Shared Memory Parallelization

As also demonstrated e.g. in [37] an efficient shared-memory parallelization isessential for sufficient performance on the Intel Xeon Phi coprocessor. In caseof ls1 mardyn, due to periodic boundary conditions, a minimal number of MPIprocesses is even more important: For the distributed-memory parallelization,particle pairs crossing process boundaries are computed twice. Running withmore than 100 ranks per Xeon Phi card creates an enormous amount of boundaryand halo cells and results in massive communication and computation overheads,especially in comparison with a Intel Xeon processor running e.g. only 16 MPIranks. Therefore we parallelized the code as far as possible with OpenMP. Firstwe discuss the parallelization of the interaction computation, and then of theremaining major steps of the simulation.

Parallelization of the Interaction Computation. The parallelization ofshort-range interactions computed by the linked-cells algorithm has been in-vestigated by Smith [91], using a replicated-data strategy. Plimpton compareddifferent parallelization strategies in more detail [70]. Kalia [49] presented a par-allelization for distributed-memory MIMD machines. Recently, Liu et. al. [55]tried to derive a shared-memory implementation for multi-core processors. How-ever, they use an all-pairs search, i.e., O(N2), and present only relative speed-ups of different implementations rather than absolute timings. The most recentresearch is that of Kunaseth [53]. He presents a state-of-the-art hybrid OpenM-P/MPI parallelization of the linked-cells algorithm in the context of the P3Mmethod for long-range electrostatics.

For the parallelization, two basic strategies exist: in the first, each threadexclusively computes the force on a given molecule, i.e. the force between twomolecules i and j is computed two times, once for each molecule. The secondstrategy is to employ Newton’s third law. Then a force has to be added to two

31

particles, where the second particle typically belongs to a different thread. Sincein the second case also the force vector of the second molecule has to be updated,the second approach reaches only about 60 % higher sequential performance. Forparallel execution, the concurrent access to the force array has to be resolved.

Resolution possibilities investigated by, e.g. Buchholz [14] and in [55] are thelocking of complete cells, the introduction of critical sections for the force update,and array privatization. Locking cells or the introduction of critical sections isno solution, since many locks would be required, causing high contention overthe large number of cores on the Xeon Phi. Experiments with a locking-free cell-coloring approach, revealed that the number of cells in one phase is too smallin realistic scenarios to offer enough parallelism for that many threads. Avoid-ing synchronization by separate, thread-private force arrays requires additionalmemory per particle and thread, i.e. memory scales as N×p, which is prohibitivefor 100 threads or more. Here, the data-privatization scheme of Kunaseth [53]might offer a scalable solution, which we might evaluate in the future.

Concluding, the parallelization of ls1 mardyn’s cell traversal is challengingdue to the use of Netwon’s 3rd law. Especially when running smaller problemsthere is hardly enough parallelism to fully leverage Xeon Phi’s compute power.Therefore, we neglect F = ma in our developed shared-memory parallelization,which is also a common practice on GPU accelerated systems [4, 78, 92]. Eachthread is assigned a number of cells, both inner and halo cells to mitigate loadimbalances, and computes the interactions for the particles in these cells with alltheir neighbors. Thereby, the linked-cells data structure is concurrently traversedby all threads synchronization-free, requiring synchronization only at the end ofthe traversal to reduce global values such as potential energy. In the upcomingperformance evaluation we will compare this shared memory parallelized versionof the force calculation to ls1 mardyn’s original pure MPI parallelization usingmore than 100 MPI ranks.

Parallelization of the Remaining Parts. Most work found in literature con-cerning shared-memory parallelization focuses on the parallel interaction com-putation, which consumes by far the highest fraction of the runtime. This isreasonable for moderately parallel implementations with eight or 16 threads,but not sufficient when moving to large threads numbers as they are requiredfor the Intel Xeon Phi. Here, the sequential parts of the simulation dominate dueto Amdahl’s Law and inhibit program efficiency, so that also these parts needto be parallelized.

An example to visualize this is given in Table 1. As discussed before, theMPI parallelization conceptually incurs overhead due to the halo regions, butperforms rather well on the Intel Xeon Phi, since all parts of the simulation areinherently executed in parallel. Therefore, we consecutively parallelized the mosttime-consuming remaining parts of ls1 mardyn. Basis of this implementation isthe parallelism on the cell level in the linked-cells data structure. The key pointsof the implementation are outlined in the following.

32

Simulation Step 240 MPI ranks 240 OMP threads

Force calculation 22.91 21.07Particle Exchange 5.99 112.54Time integration 0.4 45.83Deletion of outer Particles 0.96 14.01Computation of Statistics 0.30 39.97

Table 1. Comparison of the runtimes of the different algorithmic steps of the simulationon the Intel Xeon Phi. A pure MPI-based parallelization and a pure OpenMP-basedversion are compared, where in the OpenMP version only the interaction computationhas been parallelized, i.e. other steps are executed sequentially. Runtimes are given per100 iterations in the simulation of 600,000 molecules of the Lennard-Jones fluid

Linked-Cells Data Structure The linked-cells data structure is usually im-plemented in a way that molecules are kept in a global list, and cells storeonly pointers to molecules. The use of a global list forbids efficient paral-lelization, however. Therefore, the global list has been dissolved and owner-ship over molecule objects has been transferred to the cell objects. Then allfollowing steps can at least conceptually be easily parallelized over cells.

Time Integration and Statistics Time integration and computation of statis-tics are embarrassingly parallel. To parallelize loops over particles, the exist-ing sequential iterator over molecules has been complemented by a paralleliterator, which parallelizes the loop on a per-cell basis.

Update of the Linked-Cells Data Structure In order to sort particles intonew cells, each thread is assigned a number of cells, and read-write accessto cells assigned to other threads has to be synchronized. Here, a double-buffering technique with two lists per cell for the particle pointers is em-ployed, similar to the GPU implementation described in [98], and the up-date proceeds in two steps: First, for all particles in a cell their new index iscomputed. If a particle stays in its cell, it is immediately removed from thecurrentParticles list and stored in the newParticles list. Otherwise, itsindex is stored in a separate index array. After that, all threads are synchro-nized and search the neighboring cells’ index lists. If they find molecules withtheir own cell index, these molecules are copied. That scheme works well, asthe index computation is relatively expensive and is efficiently parallelized,while the comparison and copy-operation is rather cheap.

Particle Exchange For the particle exchange, boundary particles first are col-lected in one array. This is again parallelized on the cell level, and resultinglists per thread are reduced to one global array. The MPI send-/receive oper-ation in the case of hybrid parallelization is executed by the master thread.The particles received have to be inserted into the linked-cells data struc-ture, similar to the update step: the array of particles is divided in equaljunks, for which each thread computes cell indices and stores these into aseperate index array. Finally, each thread iterates over the whole index array,compares each index with its cells, and inserts particles if required.

33

Depending on the algorithmic step, less than 240 threads are utilized due toa lack of parallelism, e.g. less boundary cells than threads exist. Still, sufficientparallel speed-up for the full simulation is achieved to enable an efficient execu-tion on both the Intel Xeon E5 processor and the Intel Xeon Phi coprocessor,as the evaluation will show.

3.3 Spatial Domain Decomposition

Spatial domain decomposition is used for the MPI-based implementation. It sub-divides a domain into regular pieces of equal size. In Fig. 13 this is shown for fourprocesses. First the domain is subdivided according to the number of processesalong each dimension, then for each process the linked-cells data structure is setup. This order guarantees a more regular subdivision than creating a subdivisionbased on the cells structure imposed by the linked-cells. The sub-domain of eachprocess is surrounded by a layer of halo-cells, which actually reside on the neigh-bouring processes. The particles of these halo-cells have to be communicated ineach iteration. Here it is convenient to use a Cartesian topology in MPI, whicharranges processes in a virtual 3D torus.

It is most efficient to perform communication along the three spatial dimen-sions, thereby reducing communication steps and synchronization as much aspossible. This is shown in 2D in Fig. 13, where particles are first communicatedalong the x-, and then along the y-axis. In this way, particles in the black cellare communicated with two communication steps instead of three. In 3D, thispattern requires only three instead of seven communication steps.

The communication along spatial dimensions can be done by non-blockingand overlapping MPI send/receive operations: first all processes start asyn-chronous receive operations for both left and right neighbours along that di-mension. Then they start the send operation, and wait for all send- and receive-operations to finish. This kind of parallelisation requires only local neighbourcommunication and exhibits excellent scalability, as will be proven later. For

1.

2.2.

Fig. 13. Spatial domain decomposition: The original domain is split into four sub-domains. The particles in the black corner cell are first sent to the right neighbouralong the x-axis, then both lower processes send the particles along the y-axis to theupper neighbours.

34

the computations of global statistical values such as potential energy or pres-sure, MPI Allreduce() is required. As only a small number of data elements isinvolved, this global communication does not represent a major bottleneck oncurrent platforms.

However, two drawbacks of this scheme have to be noted: first of all, it gen-erates computational overhead. Interactions between particles in halo-cells arecomputed twice (once per process). Alternatively, forces could be computed byone process and sent to the neighbouring ones. According to the notion that“FLOPS are free”, computation is preferred over communication. A more severelimitation is that the minimal size of a sub-domain is 2x2x2 cells. For smallersub-domains, particles need to be communicated to the next two neighbouringprocesses, if they migrate to the neighbouring process. An interesting alternativeare neutral territory methods developed recently [12], because they allow smallerhalo regions and consequently reduce the overhead of associated redundant com-putations.

3.4 Load Balancing Based on KD Trees

In heterogeneous scenarios such as nucleation, density fluctuations up to a factorof 100 may occur. As the computational effort scales quadratically with density,sub-domains containing liquid may cause up to 10,000 times the computationaleffort compared to gas. This renders a naive domain decomposition inefficient,due to the load imbalance between processes. Therefore, efficient load balancingtechniques are required. In [14], Buchholz investigated four strategies based ongraph-partitioning, diffusion, space-filling curves and KD-trees. While the ap-proach based on space-filling also provided very good results, the KD-tree basedapproach fits especially well with implementations of the linked-cells algorithmand domain decomposition as described above and was therefore chosen for im-plementation in ls1 mardyn. Its principle is described, following the formalismintroduced in [14].

Fig. 14(a) shows a scenario with heterogeneous particle distribution, where aregular domain decomposition for four processes leads to load imbalance. Herewe consider only the number of particles as a measure for load. Fig. 14(b) shows adecomposition based on KD-Trees, where a perfect partitioning is achieved, andFig. 14(c) illustrates the corresponding tree. This tree is recursively constructed:The root node of the tree represents the total simulation domain, all availableprocesses are assigned to it. The simulation domain is now split along a planeinto two sub-domains, which are assigned an appropriate number of processes, sothat total computation cost and cost incurred by the partitioning are minimized.In the example, the domain is split and process 0 is assigned to the “left” area.With this splitting, two child nodes are added to the root node, where the leftone represents the leaf containing only process 0, the right one represents theremaining part of the domain and is assigned processes 1–3. For that node, thissubdivision process is recursively continued.

35

P0 P1

P2 P3

(a) Regular domain decom-position.

P0

P3

P2

P1

(b) KD-Tree based domaindecomposition.

Load = 24[P0;P3]

Load = 6[P0]

Load = 18[P1;P3]

Load = 12[P2;P3]

Load = 6[P1]

Load = 6[P2]

Load = 6[P3]

(c) KD-Tree corre-sponding to thedecompositionshown in b).

Fig. 14. Regular domain decomposition and KD-Tree based decomposition in compar-ison. Left, the regular domain decomposition leads to load imbalance, which is avoidedby the KD-Tree based decomposition.

The total computation cost CostA for an area A for a splitting is determinedby two parts:

CostA = AreaCostA + SepCostA. (4)

The first one is the computation cost associated with an area AreaCostA.Let Neighbour(i) denote all neighboring cells of a cell i and Ni is the numberof particles in a cell. Then the computational cost of an area A is

AreaCostA =∑

i∈AreaA

N2i +

∑j∈Neighbours(i)

1

2NiNj

.

The second part of Eq. (4) is the cost caused by the separation, SepCost,because the interactions between cells of different sub-domains have to be com-puted twice, once for each neighboring process. Therefore it is beneficial to avoiddivisions through areas with high density, such as droplets. The search for sepa-ration planes is depicted in Fig. 15. Let TA denote the boundary cells of area Aalong a split plane and P (i) the owning processor of a cell i, then these cost forseparation can be determined as

SepCostA =∑i∈TA

∑j∈Neighbours(i)

P (i)6=P (j)

NiNj .

For each separation sep, the average costs per process are determined:

CostPerProcess =CostA + CostB

P,

36

Cos

t for

are

a B

Cos

t for

are

a B

Cos

t for

div

isio

n

Cos

t for

are

a B

Cos

t for

are

a B

Cos

t for

div

isio

n

Cos

t for

are

a B

Cos

t for

are

a B

Cos

t for

div

isio

n

sep = 2 sep = 3 sep = 4

Fig. 15. Visualization of different separation planes and the associated cost. Inter-actions between hatched cells is computed twice due to the subdivision (adaptedfrom [14]).

where P is the total number of processes. From that the number of processesassigned to each area is computed: PA = CostA/CostPerProcess and PB = P−PA. Then the effective cost per process PCA are computed as PCA = CostA/PAand PCB = CostB/PB . For a splitting the separation plane is chosen, whichminimizes the maximal effective process cost:

minsep

max{PCA, PCB}.

This load-balancing strategy is a core ingredient of a molecular simulationprogramme for nano-fluidics.

37

4 Efficient Implementation of the Force Calculation inMD Simulations

As stated in the introduction, scientists are faced with many different types ofMD simluations. This complexity is even increased by the numerous kinds ofcomputing systems available today, and this trend will continue. At first sight itseems to be impossible to design a scientific application in such a way that it canbe executed with high performance on all these different flavors of computers.However, we demonstrate in this chapter how the computational kernel of MDsimulations can be mapped to different kinds of hardware by applying minimalchanges to the software.

Since the initial description of the linked-cells algorithm, a vast number ofoptimizations has been investigated. These range from sequential optimizationsand the efficient implementation on vector computers, cf. Sec. 4.2, to memoryor cache optimizations, e.g. [59,60]. Mostly targeting bio-molecular simulations,these optimizations have often been tested against single-centered atoms. In thischapter, we focus on rigid molecules as they have been implemented in ls1

mardyn, but also present our implementation for atomic fluids.Based on the preceding presentation of the target systems in Sec. 3.1, this sec-

tion starts with the discussion of memory optimizations in Sec. 4.1. We describethe newly developed Sliding Window Traversal of the linked-cells data structure,which enables the seamless integration of the new optimisations. That traversalis the foundation of Sec. 4.2. Here we explain how one of the most importantMD kernels, the calculation of the LJ-12-6 potential, can be efficiently mappedto modern hardware for multiple interaction centers. We finally close the imple-mentation chapter by pointing out optimization opportunities in case of havingjust one LJ-12-6 center, which is a common benchmark for MD application, inSec. 4.3.

4.1 Memory Access Optimizations

The initial software design of ls1 mardyn focused on single particle pair interac-tions. This is reasonable from a modelling point of view. From an efficiency pointof view it is preferable to change that focus to groups of particles, since vectorcomputing extensions of current microprocessors work best on long arrays, i.e.multiple particles. Also for the memory efficient implementation, such a group-ing is of advantage. In order to seamlessly integrate the new concepts derivedin the following, a refactoring was carried out, which replaces the particle paircentred view by a cell-centred view. This refactoring is based on the observationthat the access pattern of the cells can be described by a sliding window, whichmoves through the domain. This sliding-window traversal is an integral part ofthe algorithmic modifications introduced.

After a cell has been searched for interacting particles for the first time ina time step, its data will be required for several successive force calculationswith particles in neighbouring cells. If the force calculation proceeds according

38

6 7 8 9 10

18 19 2015 16 1712 13 1411

25 26 2722 23 2421

5

28

4321

29 30

Fig. 16. Sliding window (cells in bold black frame) in 2D. Particles in cells in thewindow will be accessed several times, cells 2 through 24 are covered by the windowin FIFO order. For the force calculation for the molecules in cell 13, cell 24 is searchedfor interacting particles for the first time in this iteration. The particles in cell 2 arechecked for the last time for interactions.

to the cells’ index as depicted in Fig. 16, these data accesses happen withina short time, until the interactions with all neighbours have been computed.The range of cells between the cells with highest offset and lowest offset can beconsidered as a window. While the cells in this window are accessed several times,they naturally move in and out. In the example shown in Fig. 16, the cells areprocessed row-by-row, first increasing the current cell’s x-index, followed by they-index. Thereby, cells are processed in a FIFO order according to their index.The forces on the molecules in cell 13 are evaluated. The particles of cell 24 areconsidered for the first time during this round of force evaluations, whereas cell2 will not be searched again during the current iteration. When the forces onthe particles in cell 14 are calculated, cell 25 will be searched for the first time,whereas cell 3 will not be touched any more, and so on. In that way, a slidingwindow ranging over three layers in total is moved through the whole domain,where the computationally most expensive actions take place.

This sliding window traversal is implemented in the ParticleContainer

classes based on the observer pattern. Initially, the particle traversal in theclasses LinkedCells and AdaptiveSubCells consisted of two distinct loops.The containers store two index sets, one for the cells inside the sub-domain, andone for the boundary cells. For all cells inside the domain, all forward-neighbourswere computed, followed by the computation of the forward and backward neigh-bour cells of the boundary cells. This loop structure has been changed to oneloop over all cells. Each cell stores now, whether it is a boundary cell and canbe treated differently inside the loop. To cleanly separate the traversal of cellsfrom operations on cells, the interface CellProcessor was introduced, shown inFig. 17. The methods initTraversal() and endTraversal() are called beforeand after the cell traversal is performed. In the beginning, the CellProcessor

is passed the number of cells in the sliding window. preprocessCell() andpostProcessCell() are called for each cell that enters or leaves the slidingwindow, and processCell() is called for each cell, when it is the current cell

39

(corresponding to cell 13 in Fig. 16). Following that call, processCellPair() isexecuted for all cell pairs involving the current cell.

Fig. 17. New software layout for the traversal of particle pairs: Operations on particlepairs are handled cell-wise by the CellProcessor, which may deletate to the legacyinterface ParticlePairsHandler.

The previous interface of particle containers allowed a free choice of internaldata structures. The new design determines that all particle containers be imple-mented by means of cells. This is no severe restriction, as even Verlet neighbourlists would be implemented on top of a cells-based data structure. The imple-mentation of an all-pairs particle container would represent a single big cell.

The sliding window traversal emphasizes the cache-friendliness of the linked-cells algorithm, and allows for the memory-efficient implementation of com-putations on particle pairs. Now, additionally needed data structures, e.g. forvectorisation, need to be allocated only for a small fraction of particle data.This implementation provides a transparent mechanism to software develop-ers. Based on this refactoring the VectorizedCellProcessor, encapsulatingthe vectorized interaction computation. To allow for backward-compatibility,a LegacyCellProcessor is provided, which implements the traversal over par-ticle pairs as it has been done previously by the particle containers, and calls aParticlePairsHandler.

4.2 Vectorization

The runtime efficient implementation of MD for short-range, non-bonded inter-actions has been long the topic of research. This has two main reasons. First, onehas to ensure that each computing element is exploited by making use of SIMDvector instructions due to the enormous demand for compute power in MD. Sec-ond, it is a hard problem for vectorisation because of the irregular nature of dataaccess and computation. We first give an overview of related approaches and de-tail then our own. For the sake of simplicity, we set out with the vectorisationfor single-centred molecules and extend it then to complex molecules.

40

Related Work Algorithmic optimisations of the linked-cells algorithm havealready been sketched in Sec. 2.3, so we focus on optimisations with respect toimplementation here. Early work concentrates on the implementation on vectorarchitectures such as CRAY or NEC machines. Schoen [87] describes an imple-mentation for CRAY vector processing computers based on Verlet neighbourlists, Grest [33] combined that with linked-cells. Everaers [26] improved on thatwith the Grid Search Algorithm, which uses a very fine grid to sort particlesinto and vectorises over that. Probably the most promising approach to the vec-torisation of the linked-cells algorithm is the Layered-Linked-Cells Algorithmdescribed by Rapaport [75,76], tuned to systems of several billion particles [77].The fundamental idea is not to vectorise the inner-most loop over particles,but to vectorise the outer-most loop over cells, which is achieved by sortingparticles into layers per cell. These layers are processed so that only disjunctparticle pairs are created. This approach is most efficient for scenarios with alarge cell count, and approximately equal particle count per cell. However, allthese aforementioned approaches for vector computers heavily rely on gather-/scatter instructions, which do not exist in vector instruction set extensions ofconcurrent CPUs. Rapaport compared his implementation for a Cray vector pro-cessor to an equally powerful Intel Xeon processor, and found the performanceof the vectorised version on both the Cray and on an Intel Xeon inferior to thescalar version run on the Xeon [77]. Not too long ago, Benkert [7] evaluated anumber of existing implementations for vector processing computers on a NECSX-8, stating that “key problems are the complicated loop structure with nestedif-clauses” as well as latencies due to indirect memory references. They concludethat “an improvement can only be achieved by developing new algorithms”.

Since the introduction of vector instruction set extensions to commodityprocessors, efforts have been made to accelerate MD application software. E.g.GROMACS has been vectorised early using SSE [54] and has also been portedto the cell processor by Olivier [65]. Peng [67] focused on hierarchical parallelisa-tion using MPI, Multithreading and SIMD parallelisation. All these approachesvectorise over the three spatial dimensions of positions, velocities, forces, andso on. While this can be done automatically by the compiler, the theoreticallypossible speed-up is reduced from 4 to 3 in single precision, and from 2 to 1.5in double-precision, respectively. Peng additionally applies zero-padding, whichincreases the memory required by 33 % and reduces memory bandwidth. Hestores particle data in one large array. Data is not re-sorted according to cellmembership in the course of the simulation, which results in performance degra-dation due to irregular memory accesses. In the latest version of GROMACS,a new general approach to vectorisation has been implemented, targeting archi-tectures with different vector lengths [72]. This technique potentially improvesthe performance of the vectorised implementationb by gridding-and-binning. Ittries to sort particles in a favourable way which is important for long neighborlist being handled on architectures with wide vector registers.

41

x_1y_1z_1f_x_1f_y_1f_z_1v_x_1v_y_1

v_z_1a_1b_1x_2y_2...

x_1x_2x_3...

y_1y_2y_3...

�

b_1b_2b_3...

051264 byte (512 bit)

Fig. 18. AoS to SoA conversion: In order to allow for efficient vectorization, corre-sponding elements have to be stored for data streaming access.

Recently, also the implementation of short-range MD on the Intel MIC archi-tecture has raised interest [21, 68], because its instruction set is similar to SSEand AVX, and additionally supports gather / scatter operations.

General Considerations when Vectorizing ls1 mardyn As shown in Fig. 8ls1 mardyn is written in C++ and applies object-oriented design principles withcells and particles being single entities. On the one hand, the object-orientedmemory layout is cache-efficient by design because particles belonging to a cellare stored closely together. On the other hand, implementing particles as singleentities causes array of structures of (AoS) data structures that prevent easyvectorization as discussed before.

Implementing a vectorized LJ-12-6 force calculation with AoS-structures isnearly impossible since elements are scattered across several cache lines as shownin the upper part of Fig. 18. Only simple sub-parts of MD, which are memory-bound in general, such as updates of a single member or thermostats, do notsuffer from such a memory layout. Here, prefetch logic inside the hardware loadsonly cache lines containing data which have to be modified. Taking into accountthat we need an entire sub data structure, e. g., position, forces, etc. in all threespatial coordinates during the force calculation, a temporary structure of ar-rays (SoA) should be constructed in order to address cache line pollution andvectorization opportunities as illustrated in the lower part of Fig. 18.

Enabling ls1 mardyn with such an SoA working buffer is straightforward.Even more important, our implementation matches ls1 mardyn’s C++-drivenobject-oriented software design. The original version of ls1 mardyn (see Fig. 18)handles all particle interactions on a particle level which, as stated above, pro-hibits vectorization. However, due to the nature of the linked-cells algorithmthese particle interactions are always called by iterating through particles in twocurrently interacting cells. Therefore the concept of the CellProcessor is notonly usefull for memory optimization but it also allows efficient computation.Recall, the member function processCellPair can be used to implement thecalculation of the LJ-12-6 potential forces (and of course also other potentials byother implementations) on a cell-pair basis and the members preProcessCell

and postProcessCell are used to prepare the memory. Exactly, the latter ones

42

can be used to provide SoA working buffers on the fly. After a cell pair hasbeen processed the updated values are copied back into the original AoS datastructures, which happens naturally when calling postProcessCell. Note thatthese additional copies do not matter for complexity reasons. Let us assume thatboth interacting cells contain m particles each. This means that we need O(m)cell-local copy-operations for buffer handling but the interaction itself requiresO(m2) cell-local calculations, which is significantly higher.

With this SIMD-friendly temporary data structure we will now cover thevectorization of the calculation of the LJ-12-6 potential force for particles withseveral sites. In addition to the LJ-12-6 potential force we also compute statisticalmeasurements such as virial pressure and potential energy on the fly. Thesevalues are important for scientists to quickly decide if the executed simulationyields reasonable results. From an implementation point of view they do not addfurther challenges and we will neglect them in the remainder of this chapter. Thediscussion of our vectoization splits into two sections. First we describe how ls1

mardyn has been re-engineered to support standard x86 vector extensions suchas SSE and AVX. Second we give an outlook on using gather and scatter SIMDvector instructions which will play an important role in emerging x86 processorssuch as Xeon Phi. The work presented here extends prototype implementationspublished in [19–21].

Using Standard SIMD Vector Instruction Sets The most challengingpart when vectorizing the LJ-12-6 force calculation between two multi-centeredparticles is the decision if the forces should be calculated or not. This is doneby comparing the distance of particles instead of the centers within the parti-cles whereas the calculation of the force takes place between the centers. Thisrequires a complicated-coupled vectorization approach, especially in scenarioswith various particles and different numbers of centers. First we would have tocalculate the distance between particles using a vectorization of particles anddecide if the forces need to be computed. Afterwards we would have a SIMDvector register containing the decision. Complex unpacking routines of variouslength that handle the different number of centers are necessary for a vector-ized center processing as unnecessary LJ potential force calculations have to bemasked out. Since we must execute this complex selection before starting thecalculation of the force it would be exposed for every center-center iteration.

Due to these huge overheads we vectorize the force calculation in a slightlydifferent way as shown in Alg. 1. In our implementation we construct a smalllook-up table for each particle i in the first cell interacting with all particlesj from the second cell (the same holds true if interactions within a cell arecomputed) on the fly, named m. The length of vector m corresponds to thenumber of centers in the second cell and it contains the decision if particle iinteracts with particle j on a center base of particle j. Before continuing withthe force calculation we check if an interaction is happening at all, otherwise wedirectly proceed with particle i + 1 of the first cell. Computing traces unveiledthat this happens in roughly 30% of all interactions. From the descriptions of

43

Algorithm 1 Schematic overview of the implementation of the LJ-12-6 potentialforce calculation used in ls1 mardyn.

1: CP ← getCurrentCellPair()2: createSoA(CP.c1)3: createSoA(CP.c2)4: for all pi ∈ CP.c1 do5: for all pj ∈ CP.c2 do6: if getDistance(pi, pj) < rc then7: for c ∈ pj do mjc ← 1 end for8: else9: for c ∈ pj do mjc ← 0 end for

10: end if11: end for12: if |m| = 0 then13: continue14: end if15: for all c ∈ pi do16: {This loop over all centers in c2 is vectorized}17: for all jc ∈ CP.c2.centers do18: if mjc = 1 then19: calculateLJ (c, jc)20: end if21: end for22: end for23: end for

Sec. 3.1 we can derive that such a preprocessing is much faster than the complexselection and unpacking routines discussed earlier.

Taking the preprocessed selection m as input the vectorization of the LJ-12-6potential force calculation is straightforward as summarized in Fig. 19. For theforce calculation we switch to a center-based processing. We load the first centerof particle i in the first cell and four centers of particles in the second cell whenusing AVX. In case of SSE we can only load two centers due to the SIMD vectorregister width of 128 bits. Note that there is no requirement that all centers mustbelong to a specific set of particles. In addition to these four centers we load thecorresponding four entries of m into p. If all four entries of p are zero we continuewith the next four centers in the second cell. If instead at least one entry of pis 1, we perform the force calculation for all four centers, and interactions notneeded are masked in the end to zero their contribution.

This masking is the major weak point of the proposed vectorization approach.If we simulate two center particles it might happen that in case of an AVX vec-torization just half of the SIMD vector register is utilized since the second half iszeroed out.. A possible solution is to sort the particles adequately to avoid suchlow vector loads as described in [72]. Due to lower particle counts in the linked-cells algorithm than offered by the employed neighboring lists in [72], we cannotreproduce these enhancements. Another possibility is to rely on additional hard-

44

x_3x_4

y_3y_4

z_3z_4

x_a

y_a

z_a

p_i >

0

1 <

i <

4

calculate forces (LJ-12-6)

f_x_3f_x_4

f_y_3f_y_4

f_z_3f_z_4

f_x_a

f_y_a

f_z_a

creating corresponding p

from mx_1x_2

y_1y_2

z_1z_2

f_x_1f_x_2

f_y_1f_y_2

f_z_1f_z_2

Fig. 19. Kernel vectorization: the vectorization of the LJ-12-6 force calculation is op-timized by duplicating one particle center in the first cell and streaming four otherparticle centers from the second cell.

ware features. However, current x86 CPUs do not offer instructions which allowfor a full SIMD vector register utilization. The required instructions are calledgather and scatter. Gather and scatter enable a SIMD vector unit to load/storeelements from/to different memory locations to/from a SIMD vector register. Incontrast to standard x86 CPUs the Intel Xeon Phi coprocessor implements bothinstructions and we implemented a version of ls1 mardyn’s vectorized LJ-12-6force calculation on this hardware. The implementation idea is described in thenext section.

d_a_m0 d_a_m1 d_a_m2 d_a_m3

(a) Computation with SSE: two dis-tance computations, only one forcecomputation.

d_a_m0 d_a_m1 d_a_m2 d_a_m3

(b) Computation with AVX: one dis-tance computation, one force com-putation.

Fig. 20. Comparison of vector computation with SSE and AVX.

Our approach to vectorisation works well for SSE [19] in double-precision, i.e.,with a vector length of two. When moving from SSE to AVX, performance doesnot double, which would be expected. This observation holds even more on IntelXeon Phi coprocessor with a vector width of eight. The reason can be explainedwith help of Fig. 20. In that example, only one out of four possible interactionshas to be computed. With SSE, see Fig. 20(a), the distance is computed forthe first two particles, then the force computation is skipped, and distance andforce computation for the last two particle pairs is executed. When the samesituation is encountered with AVX, one distance computation and one forcecomputation has to be performed. So in the case of AVX, only one comparablycheap distance computation is saved and no gain for the force computation isobserved. On average only every fifth interaction has to be computed and suchcases happen frequently. Please note, we extended this scheme to handle alsointeractions caused by charges. This is straightforward and no further detailed

45

x_12x_15

y_12y_15

z_12z_15

x_a

y_a

z_a

calculate

forces (LJ-12-

6)

f_x_12f_x_15

f_y_12f_y_15

f_z_12f_z_15

f_x_a

f_y_a

f_z_a

x_8x_11

y_8y_11

z_8z_11

f_x_8f_x_11

f_y_8f_y_11

f_z_8f_z_11

x_6x_7

y_6y_7

z_6z_7

x_1x_2

y_1y_2

z_1z_2

gather

with o

centers in

second cell

f_x_6f_x_7

f_y_6f_y_7

f_z_6f_z_7

f_x_1f_x_2

f_y_1f_y_2

f_z_1f_z_2

forces on

centers in

second cell

scatter

with o

Fig. 21. Kernel vectorization utilizing gather and scatter instructions: instead of mask-ing unnecessary force calculations we gather only “active” interaction centers from cellj and scatter the calculated forces back to cell j.

explanation is needed as the only difference is the actual math for computingthe potential itself.

Using Gather and Scatter on the Intel Xeon Phi coprocessor Replacingthe discussed masking techniques with gather and scatter instructions appearsto be straightforward at first sight. Instead of constructing a mask vector m wecreate an offset vector o when calculating the distances between particles. o isafterwards used to only load those centers from the second cell which definitlyinteract with the center of the particle in the first cell. We sketch this principlein Fig. 21. As before, we load one center from cell i which now interacts witheight centers from the second cell j. This is contributed to the increased SIMDvector width of the Intel Xeon Phi coprocessor. These eight centers are loadedby a gather instruction taking the offsets in o to skip particle centers whichare excluded because of the cutoff constraint. This avoids unnecessary forcecalculations that are masked in the end and increases the utilized vector entriesto 100%. Especially in case of Xeon Phi’s wider registers this is a very criticalimprovement. After the force calculation, o is used again during the scatteroperation which stores the recently calculated forces back to cell j. Fig. 21 depictsa scenario with two- and three-center particles. This can be identified by sub-groups in the gathered entries. The first particle is a two-center one, the secondhas three centers and is followed by a two-center particle. The last entry of theSIMD vector register is filled by a single center which can either be part of atwo- or three-center particle. The “missing” centers will be processed in the nextcall of the force calculation kernel.

While gather and scatter instructions perfectly match our requirements forthe force calculation, the creation of the index vector o requires further assis-tance by hardware since our scenario has varying offsets. The first applicationmany scientists think of when using gather instructions is a sparse matrix vectormultiplication. Here, the offsets are fixed and given by a constant sparsity pat-

46

x_15x_16

y_15y_16

z_15z_16

x_a

y_a

z_a

calculate distances and

compare to cutoff radius

x_13x_14

y_13y_14

z_13z_14

x_11x_12

y_11y_12

z_11z_12

x_9x_10

y_9y_10

z_9z_10

x_7x_8

y_7y_8

z_7z_8

x_5x_6

y_5y_6

z_5z_6

x_3x_4

y_3y_4

z_3z_4

x_1x_2

y_1y_2

z_1z_2

m interaction mask-vector

1215 811 67 12

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

o gather/scatter offset-vector

vcompress instruction

X 1 0 0 1 1 0 0 1 1 1 0 0 0 1 1

Fig. 22. Calculation of the offset vector o: the offset vector o is calculated by applyingthe mask vector m to an indexing structure.

tern. Even when using adaptive mesh refinement (AMR), the matrix is createdonly once per iterative solver call and the costs of generating the offsets vanishand therefore can be implemented without focusing on performance. In our ap-plication this is significantly different. A gather and scatter offset vector is onlyvalid for one force calculation. The leapfrog integration afterwards updates thepositions of the particles and in the next time step different interactions due toother positions of particles take place. In order to enable such applications touse gather and scatter instructions, Xeon Phi’s ISA includes a powerful helper-instruction which saves huge overheads: vcompress. This instruction implementsa masked store that only stores those entries of a SIMD vector register to the L1cache which are marked. Figure 22 shows the usage of vcompress when creatingo. The beginning is identical to the original version as we compute the mask vec-tor m. Instead of keeping it for later reuse in the force calculation we generatea regular increasing index vector and call the vcompress on it with m as input.This results in just storing the entries of the index vector that corresponds toparticle distances smaller than the cutoff radius rc. Similarly to computing thewhole m mask vector in case of AVX/SSE on Xeon Phi we create the entire ooffset vector for one particle of cell i interacting with all particles of cell j.

4.3 Optimization Possibilities for Monatomic Fluids

A standard benchmark of MD codes is the simulation of a noble gas since thisboils down to a single LJ-12-6 center. In such a simulation even single-precisionnumbers are sufficient. We therefore forked a special version of ls1 mardyn

which uses (1) single-precision numbers, (2) features a special force calculationwith calculating masks p on the fly and (3) involves a memory optimization forrunning very large simulations.

Running particle simulations which use all available memory has a long tradi-tion. The work that is summarized on the next pages was published first in [20].

47

At the time of writing this work an MD simulation of 4.125·1012 atoms on nearly150,000 cores materialized a new world record. Therefore, this contribution washonored with the PRACE ISC 2013 Award during the ISC 2013 conferencein Leipzig, Germany, in June 2013, and continued a series of publications onextreme-scale MD. In 2000, [82] performed a simulation of 5 · 109 molecules,the largest simulation ever at that time. It was followed by [30, 48] holding theold world record with 1012 particles in 2008. These simulations demonstratedthe state of the art on the one hand, and showed the scalability and perfor-mance of the respective codes on the other hand. More recent examples includethe simulation of blood flow [73] as well as a force calculation of 3 · 1012 parti-cles [47], however without calculating particle trajectories. Finally, we have tonote that our run is probably not the largest one anymore at the time of thiswriting (October 2013) as a small notice on the BlueWaters system indicates8.Here 7 · 1012 particles were simulated using comparable time integration as usedin ls1 mardyn. However, there is no scientific publication on this work whichwould allow a detailed comparison. The notice shows that researchers used a2.5X bigger machine and executed a particle in cell (PIC) simulation and not aMD simulation.

Specialized Force Calculation for Single Center Particles For our single-center and single-precision specialized version, AVX128 instructions were em-ployed so that we can run this implementation with best possible performanceon a wide range of processors. Therefore, the calculation is performed on fourparticles concurrently. We broadcast-load the required data of one atom into thefirst register (a), the second register is filled with data from four other atoms(1, 2, 3 and 4), as depicted in Fig. 23. Instead of pre-calculating a mask vectorm, we need to apply some pre- and post-processing by regular logical operationsdirectly within the force computation kernel. As when computing m, it has to bedetermined, if for any particle pair the distance is less than rc (pre-processing),because only then the force calculation has to be performed. If the force calcu-lation has been done, the calculated results need to be zeroed by a mask for allparticle pairs whose distance is greater than rc (post-processing). This optimiza-tion can be chosen since the number of particle per cell is equal to the numberof interacting Lennard-Jones centers which makes a blow-up of m unnecessary.

Optimizing the Memory Footprint of ls1 mardyn In order to achieve low-est possible memory consumption we reduced the size of a particle to 32 bytes(24 bytes for positions and velocities in x, y, z direction and an 8 byte identifier).Furthermore, we enhanced the linked-cells algorithm with a sliding window thatwas introduced in [22] and naturally matches our AoS to SoA conversions ex-plained in Sec. 4.2. The sliding windows idea is based on the observation that theaccess pattern of the cells acts like a spot light that moves through the domain.At the moment the data of a cell is needed the position and velocities stored in an

8 http://www.ncsa.illinois.edu/News/Stories/PFapps/

48

http://www.ncsa.illinois.edu/News/Stories/PFapps/

x_3x_4

y_3y_4

z_3z_4

x_a

y_a

z_a

d_a_3d_a_4

d_a_i <

rc

uto

ff

1 <

i <

4

calculate forces (LJ-12-6)

f_x_3f_x_4

f_y_3f_y_4

f_z_3f_z_4

f_x_a

f_y_a

f_z_a

calculate distances x_1x_2

y_1y_2

z_1z_2

f_x_1f_x_2

f_y_1f_y_2

f_z_1f_z_2

d_a_1d_a_2

Fig. 23. Specialized kernel vectorization in case of single-center particles: if only single-center particles are simulated we can skip the calculation of m and use on-the-fly masks.

AoS manner are converted to an SoA representation whilst allocating additionalspace for storing the to-be-computed forces. To avoid the overhead of repeatedmemory (de-)allocations and resulting page faults when converting a cell’s datastructure, we implemented the AoS structure as a global buffer which is onlyincreased if required, otherwise it is reused for converting the data of the nextcell after the previous cell has been successfully processed. Before the positionsand velocities of the AoS structure can be updated and the AoS structure can bereleased, the time integration has to be executed. This made a small change inls1 mardyn necessary since the time integration is now called on-the-fly directlyafter the force computation and not in bulk manner on all cells as in the originalversion of ls1 mardyn.

Light-Weight Shared-Memory Parallelization for Hyperthreading TheLJ-12-6 kernel is not well instruction-balanced, as we will discuss in detail duringthe performance results discussions, impeding the use of the super-scalarity ofa Xeon E5 core. In order to make use of the Xeon’s hardware threads concept,we implemented a lightweight shared-memory parallelization via OpenMP byextending the size of the sliding window as shown in Fig. 24. This allows twothreads to perform calculations concurrently on independent cells. As we onlyuse two threads, our goal is to maintain Newton’s 3rd law. Therefore, a barrierprevents threads from working simultaneously on neighboring cells. Since thesynchronization can be handled within the L1 cache its overhead is negligible.This allows the execution of one MPI rank per Xeon E5 core with two (OpenMP-)threads exhibiting sufficient ILP, leading to a 12% performance improvement.

49

6 7 8 9 10

18 19 2015 16 1712 13 1411

25 26 2722 23 2421

5

28

4321

29 30

Fig. 24. Sliding window with support for multi-threading: by choosing a window with5 cells, two threads can independently work on three cells each: thread 1 works on cells13, 14, 15; thread 2 works on cells 16, 17, 18. To avoid that threads work on same cells(e. g., thread 1 on the cell pair 15–25, thread 2 on 16–25), a barrier is required aftereach thread has finished its first cell.

5 Experiments

The performance evaluation is carried out in three parts. We start the nextparagraph by showing performance results on the Intel Sandy Bridge architecturefor scenarios containing particles with one to four centers. In all cases we ranstrong-scaling and weak-scaling scenarios on SuperMUC with a cutoff radius ofrc = 3.8σ, and analyze the performance characteristics of the implementationin depth. We choose this relative small cutoff (which leads to just a couple ofparticles per cell) since it is representative for chemical engineering applications.The particles with just one LJ-12-6 center are the noble gas argon, ethane in caseof two centers, CO2 for three centers, and acetone for four centers, respectively.Here we note, that argon as atomic fluid represents an extreme case for ourimplementation, which is specialized for multi-centered molecules. Being oneof the most important substances on earth and important beyond the field ofchemical engineering, we evaluate our implementation also at the example ofwater. Here, we use the TIP4P model, consisting of three charge sites and oneLJ-12-6 interaction site.

In the following section, we study the performance of the hybrid paralleliza-tion on the Intel Xeon Phi coprocessor, as well as its scalability across nodes.Special focus is put on analysis of the performance of the proposed gather andscatter enhanced force calculation on Intel Xeon Phi coprocessor.

Finally, we analyze our implementation specialized on atomic fluids, e.g. tar-geting inert gases, and describe the performance study executed on SuperMUC.Anticipating the evaluation, that implementation allows us to efficiently use theentire machine, enabling the world’s largest molecular dynamics simulation in2013.

50

5.1 Performance on SuperMUC

In order to analyze the performance of our implementation on standard clus-ter hardware, we conducted the strong- and weak scaling experiments for up to16,384 cores of SuperMUC, using 16 MPI ranks per node, discussed next. Fol-lowing that, we study the performance in dependence of two further importantparameters, the number of molecules per process and the influence of the cut-offradius.

Strong Scaling Experiments Figure 25 shows the obtained runtimes forstrong-scaling scenarios with N = 1.07·107 particles each, i.e. on 16,384 cores themolecule number is as low as 650 molecules per core. We compare our recentlypresented vectorization approach to the original version of ls1 mardyn utilizingthe mentioned ParticlePairsHandler. In all measurement points of Fig. 25 weare able to clearly outperform this version of ls1 mardyn, although we have topoint out that the margin becomes smaller when scaling out to all 16,384 cores.This is mainly due to the fact that the particle count per core (≈ 600) becomesso small that communication and boundary handling consume more computingtime than computing the actual particle interaction. Furthermore, we can rec-ognize the expected strong-scaling behavior. The costlier a force calculation is,the longer the runtime scaling plot exhibits an ideal shape since communica-tion and boundary handling play a minor role. Consequently, best scalability isachieved for the computationally more complex TIP4P water model, achievinga parallel efficiency of nearly 50 % on 512 nodes, i.e. 8192 cores, in comparisonto a single node. The measurements shown in Fig. 25 were taken by using two ormore islands of SuperMUC for more than 2,048 cores. Therefore, the well-knowninter-island kink can already be seen when using 4,096 cores. For all scenariosthe runtime per iteration on 256 cores is below 1 second, allowing for large-scaleproduction simulations at good parallel efficiency of 80 %.

However, these plots do not allow an in-depth performance comparison sincethe plotted numbers are spread across four orders of magnitude. In order toget a deeper understanding of ls1 mardyn’s performance we created Fig. 26.This diagram is based on GFLOPS measurements being performed simultane-ously to the runtime measurements of Fig. 25. Plotting just GFLOPS insteadof runtime would not gain any new insight, and we therefore normalized theobtained GFLOPS with the peak GFLOPS of the used number of cores. Thisdirectly emphasizes the parallel efficiency of ls1 mardyn and the impact of usinga vectorized force calculation.

There are three observations. First, the speed-up from using a vectorizedforce calculation is 2X-3X depending on the executed scenario. Reasons whythe theoretically available 4X are not achieved will be discussed soon. Second,independent of using the classic or the vectorized version of ls1 mardyn, goingfrom 1 to 16,384 cores we measured a parallel efficiency of roughly 50% for strongscaling. Finally, and this comes in conjunction with the first point discussed,although we efficiently vectorize the force calculation we are still at only 10%peak efficiency.

51

1248

163264

128256512

1,0242,0484,0968,192

16,38432,76865,536

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores

classic vectorized vectorized - ideal

(a) Runtime obtained for single-center ar-gon.

1248

163264

128256512

1,0242,0484,0968,192

16,38432,76865,536

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(b) Runtime obtained for two-centerethane.

1248

163264

128256512

1,0242,0484,0968,192

16,38432,76865,536

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(c) Runtime obtained for three-centerCO2.

1248

163264

128256512

1,0242,0484,0968,192

16,38432,76865,536

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(d) Runtime obtained for four-center ace-tone.

1248

163264

128256512

1,0242,0484,0968,192

16,38432,76865,536

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(e) Runtime obtained for tip4p watermodel.

Fig. 25. Runtime of the strong-scaling benchmark scenarios on SuperMUC using 1 to16,384 cores.

52

0%

2%

4%

6%

8%

10

%

12

%

12

48

16

32

64

12

82

56

51

21

,02

42

,04

84

,09

68

,19

21

6,3

84

peak efficiency

#co

res

Arg

on

- c

lass

icA

rgo

n -

ve

cto

rize

dE

tha

ne

- c

lass

icE

tha

ne

- v

ect

ori

zed

CO

2 -

cla

ssic

CO

2 -

ve

cto

rize

dA

ceto

ne

- c

lass

icA

ceto

ne

- v

ect

ori

zed

tip

4p

- c

lass

icti

p4

p -

ve

cto

rize

d

Fig. 26. Achieved peak performance: for each measurement point of Fig. 25 we addi-tionally recorded the obtained GFLOPS and calculated the corresponding fraction ofpeak performance.

53

Argon Ethane CO2 Acetone

AVX vector register load 60% 64% 65% 100%

Table 2. SIMD vector register utilization during the force calculation on SuperMUCfor all four scenarios. The remaining interactions are masked.

Let us start with the “too small” vectorization speed-up and reaching 16%which would be the initial guess, as the classic and scalar version runs at roughly4% peak performance. Due to the small cutoff radius only 20-40 particles arestored in one cell. This leads to only 5-10 calls of our kernel with masking hugeamounts of the SIMD vector register [21]. Some extra experiments showed thatwith cutoffs rc > 5σ efficiencies close to 16% are possible. But still even 16%appear to be too low at first sight. However, this low number can be explainedby the evolved instructions and the dependency of instructions when comput-ing particle interactions. Recalling Eq. (2) we see that multiplications dominatethe operation mix and even a division is needed. With the earlier discussed mi-croarchitectures in mind, we know that modern super-scalar processors featuremultiplication and addition units and implement ILP. Since we just stress one ofthem, our achievable peak performance is limited to 50% upfront. Furthermorefrom datasheets we know that the required division costs roughly 40 clock cy-cles. This is more than the rest of the interaction computing, so we can halve theachievable peak performance again and we end up with a number between 20%and 25%. This is exactly the performance we measured for bigger cutoff radii.Finally, we want to discuss why the performance of simulating acetone is signif-icantly better than in the other three test scenarios. This circumstance becomesimmediately clear when comparing the vector width of AVX with the numberof centers acetone has: four meets four. In this scenario no unnecessary forcecalculations take place since a force calculation is never masked, which leads toa 2% higher efficiency. The SIMD vector register utilizations obtained in all fourscenarios are summarized in Tab. 2. These results confirm the measured peakefficiencies from a different point of view: since one- to three-center scenarioshave roughly the same SIMD vector register utilization, similar performance canbe achieved.

These comments also explain the comparably low performance of the TIP4Pwater. First, the performance of the Lennard-Jones kernel is higher than forthe charge kernel. This is due to the lower arithmetic intensity of the chargekernel: While a comparable amount of data has to be loaded, fewer arithmeticoperations are performed. Among these instructions is the expensive square root,which is characterized by high latency and introduces pipeline stalls. Second, themodel features one Lennard-Jones center and three charge centers, which do notperfectly fit with the width of the vector register.

54

8

16

32

64

128

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(a) Runtime obtained for single-center ar-gon.

8

16

32

64

128

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(b) Runtime obtained for two-centerethane.

16

32

64

128

256

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(c) Runtime obtained for three-centerCO2.

16

32

64

128

256

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(d) Runtime obtained for four-center ace-tone.

8

16

32

64

128

1 2 4 8

16

32

64

12

8

25

6

51

2

1,0

24

2,0

48

4,0

96

8,1

92

16

,38

4

run

tim

e [

s] /

10

0 i

tera

tio

ns

#cores


(e) Runtime obtained for tip4p watermodel.

Fig. 27. Runtime of the weak-scaling benchmark scenarios on SuperMUC using 1 to16,384 cores.

55

0%

5%

10

%

15

%

20

%

25

%

12

48

16

32

64

12

82

56

51

21

,02

42

,04

84

,09

68

,19

21

6,3

84

peak efficiency

#co

res

Arg

on

- c

lass

icA

rgo

n -

ve

cto

rize

dE

tha

ne

- c

lass

icE

tha

ne

- v

ect

ori

zed

CO

2 -

cla

ssic

CO

2 -

ve

cto

rize

dA

ceto

ne

- c

lass

icA

ceto

ne

- v

ect

ori

zed

tip

4p

- c

lass

icti

p4

p -

ve

cto

rize

d

Fig. 28. Achieved peak performance: for each measurement point of Fig. 27 we addi-tionally recorded the obtained GFLOPS and calculated the corresponding fraction ofpeak performance.

56

Weak Scaling Experiments For the weak scaling run, 20,500 molecules perprocess have been used, resulting in 328,000 molecules per node on SuperMUC.Runtimes for the five test fluids are shown in Fig. 27 and peak efficiencies inFig. 28. As it can be expected for MD codes in general, also ls1 mardyn exhibitsvery good weak scaling behaviour. For all test fluids, a parallel efficiency of75% or higher can be stated. In the peak efficiency, slight oscillations can berecognized for process counts of 512 and 4,096, where the domain decompositionexhibits a symmetric layout with 8x8x8 and 16x16x16 processes per dimension.Then, in comparison with other process numbers communication overhead isminimized. Apart from that, the observations already stated for the strong-scaling can be confirmed.

Node-Level Performance in Dependence of the Number of Molecules.In order to better understand the behaviour of the implementation, the node-level performance was studied for varying number of molecules from 50,000 to 12million particles for 1CLJ and 4CLJ at a fixed cut-off radius of rc = 3.8σ. Fig. 29visualises FLOP-rate and time per iteration on a SuperMUC node clocked at2.6 GHz. Again it can be seen that the performance of the scalar version isalmost tripled by the AVX double precision version. For larger molecule num-bers, performance fluctuates around a fixed value and runtime correspondinglygrows linearly in the number of molecules. Only for small molecule numbers, per-formance is significantly higher and drops for growing molecule numbers. Thismight be due to cache effects. Only the AVX128 single precision version, whichwill be further investigated in the upcoming section, is different as it shows in-creasing performance with higher molecule numbers. However, performance alsosaturates at roughly 70 GFLOPS for large particle numbers.

0102030405060708090

100

10

00

00

15

00

00

0

29

00

00

0

43

00

00

0

57

00

00

0

71

00

00

0

85

00

00

0

99

00

00

0

11

50

00

00

12

90

00

00

14

30

00

00

15

70

00

00

17

10

00

00

18

50

00

00

19

90

00

00

21

30

00

00

22

70

00

00

24

10

00

00ru

nti

me

[s]

pe

r it

era

tio

n

number of molecules

Argon - classic Argon - vectorized

Argon - vectorized (SP) Acetone - classic

Acetone - vectorized

(a) runtime in seconds depending onnumber of molecules in the system.

0102030405060708090

10

00

00

15

00

00

0

29

00

00

0

43

00

00

0

57

00

00

0

71

00

00

0

85

00

00

0

99

00

00

0

11

50

00

00

12

90

00

00

14

30

00

00

15

70

00

00

17

10

00

00

18

50

00

00

19

90

00

00

21

30

00

00

22

70

00

00

24

10

00

00

GF

LO

PS

number of molecules




(b) GFLOPS depending on number ofmolecules in the system.

Fig. 29. Performance in dependence on the number of molecules at rc = 3.8σ.

57

Node-Level Performance in Dependence of the Cut-Off Radius. A sim-ilar study has been performed for test fluids composed of 10.7 million 1CLJ and4CLJ molecules, where we assess the dependence on the cut-off radius, i.e. thenumber of molecules per cell. FLOP-rate and runtime are shown in Fig. 30.The hardware performance for the scalar versions shows only weak increase andquickly saturates, especially in the case of the computationally expensive four-centered molecules. Consequently, the runtime grows quadratically just as thenumber of computations does. In contrast, the performance of the vectorized im-plementations grows strongly with increasing cut-off radius and saturates at amuch higher level. Zooming in, it can be seen that this increase in hardware per-formance is so strong that it totally compensates for the quadratically growingnumber of computations and results in lowest runtimes for cut-off radii aroundrc = 3.0σ. Although the implementation cannot escape from the algorithmicasymptotic complexity, the point where asymptotic growth starts is shifted to-wards higher cut-off radii. This effect is strongest for the AVX128 SP version. Wenote that such a behavior is favorable for applications in chemical engineering,where often larger cut-off radii are employed, which allow for simulations withhigher precision.

0

50

100

150

200

250

300

1.2 1.9 2.6 3.3 4 4.7 5.4 6.1 6.8 7.5

run

tim

e [

s] p

er

ite

rati

on

cut-off radius




(a) runtime in seconds depending on thecut-off radius.

0

20

40

60

80

100

120

1.2 1.9 2.6 3.3 4 4.7 5.4 6.1 6.8 7.5

GF

LO

PS

cut-off radius




(b) GFLOPS depending on the cut-offradius.

Fig. 30. Performance in dependence on the cut-off radius for 10.8 million molecules.

5.2 Performance on the Intel Xeon Phi coprocessor

Finally, we evaluate the performance of the implementation on Intel Xeon Phicoprocessor, using a hybrid parallelization of OpenMP and MPI. First, we focusespecially on the force calculation that employs the gather and scatter instruc-tions of the coprocessor. Here, we note that the compared versions all neglectNewtons third law, i.e. still better performance might be achieved by saving halfof the interaction computations, see discussion in Sec. 3.2. Fig. 31(a) comparesdifferent ls1 mardyn Xeon Phi derivatives to the recently discussed vectorizedversion of ls1 mardyn running on a dual-socket Xeon E5 server, such as a Su-

58

perMUC node. Results were measured by running only the force calculation forsmaller scenarios (N = 1.3 · 106 particles).

13

.60

16

.10

28

.30

49

.30

6.6

0

8.2

0

14

.50

18

.90

9.2

2

9.5

5

17

.41

26

.97

8.3

5

9.2

8

15

.41

21

.38

6.0

0

8.0

0

13

.34

19

.01

0

10

20

30

40

50

Argon Ethane CO2 Acetonerun

tim

e [

s] f

orc

e c

alc

ula

tio

n

Xeon E5-2670, 16 ranks, classic

Xeon E5-2670, 16 ranks, vectorized

Xeon Phi 5110P 120 ranks (masking)

Xeon Phi 5110P 120 ranks (gather/scatter)

Xeon Phi 5110P 240 threads (gather/scatter)

(a) Achieved performance of ls1 mardyn on the Intel XeonPhi coprocessor: roughly the performance of a dual-socketXeon E5 server can be obtained by one coprocessor.

Argon Ethane CO2 Acetone Argon Ethane CO2 Acetone

vectorization with masking vectorization with gather/scatter

initialization SoA -> AoS AoS -> SoA interaction distance

(b) Impact of using gather and scatter: the time for comput-ing particle interactions significantly decreases but thedistances calculation becomes more expensive due to thevcompress instruction.

Fig. 31. Performance analysis of ls1 mardyn’s force calculation on the Intel Xeon Phi5110P coprocessor.

The red bar is a direct port of the AVX version to the doubled vector widthof the Xeon Phi coprocessor using the masking approach introduced in Alg. 1and running purely MPI including Newton’s 3rd law. Right next to this bar,the green bar incorporates only one change. It uses gather and scatter insteadof masking unnecessary force calculations. Finally, the purple bar replaces theMPI parallelization by a shared memory parallelization running one process per

59

coprocessor with 240 OpenMP threads. Since a huge amount of boundary andhalo computations can be avoided this version outperforms all other Xeon Phiversions and even slightly the Xeon E5 server although we neglect the Newton-three optimization. These results fit well with other Xeon Phi speed-ups in caseof simulating single-center molecules [68]. Here a speed-up of 1.4X by using theXeon Phi coprocessors is reported. This translates to our numbers by scaling thespeed-up as a Xeon E5 with 400 MHz less frequency and Xeon Phi coprocessorwith 250 MHz more were used for the measurements in [68].

Unfortunately, especially for one- and two-center molecules the improvementdue to gather and scatter instructions is rather small. Therefore, we performedan in-depth analysis whose results are shown in Fig. 31(b). We observe that in allcases the cycles spent in interaction calculation itself can be significantly reduced.However, the costs of the distance calculation increase by more than a factor oftwo. Further analysis unveiled that the “problematic” part is the vcompress

instruction. This is due to unaligned accesses to the L1 cache as less than eightentries are stored back and the starting address is also not cache line alignedin general. Nevertheless, vcompress is very valuable as an implementation insoftware would add another factor of 1.5X to the distance calculation, whichwould result in lower performance when using gather and scatter.

The overall performance of ls1 mardyn on the Xeon Phi is compared inFig. 32. Fig. 32(a) compares the performance of our purely shared-memory par-allelized implementation on one Intel Xeon Phi coprocessor to the classic andthe hybrid variant on one Intel Xeon E5 node. It can be clearly seen, that theproposed implementation delivers the same performance on one Xeon Phi copro-cessor as on two Intel Xeon E5 CPUs. Here we furthermore note the influence ofan efficient vectorization: while vectorization gains a speed-up of 2–3 X on theXeon, it is crucial on the Xeon Phi with a speed-up of nearly 10 X.

Scalability from one to four nodes is shown in Fig. 32(b) for 1.6 millionmolecules of our four test fluids composed of Lennard-Jones sites. One Xeon Phiper node is running 16 MPI ranks with 15 threads, while the host system isrunning 16 MPI ranks with 2 threads each. The overall picture is similar to thescaling experiments on SuperMUC: good overall scalability is achieved, whichis better for the computationally more expensive molecule types. These testsshow that the implementation proposed in this work is suitable to efficientlyutilize heterogeneous compute clusters. The symmetric usage of the Intel XeonPhi proves especially convenient. Since we achieve approximately the same per-formance on one coprocessor as on one host node, we can utilize both host andaccelerator at the same time - something which would not be easily possible inthe offload model.

5.3 Multi-Trillion Particle Simulations

Preceding publications [30, 48, 82] used cutoff radii within the interval 2.5σ <rc < 5.0σ. We already discussed in case of multi-center particles that the usedcutoff radius has a strong influence on the obtained flop rate. For this reasonwe executed our specialized version of ls1 mardyn on eight SuperMUC nodes

60

6.0

5

3.1

5

5.3

1 8.6

6

18

.00

13

.79

24

.55

35

.00

1.7

8

1.5

6

2.9

3

4.0

6

1.8

7

1.4

8

2.5

5

3.6

5

0

5

10

15

20

25

30

35

40


[s]

pe

r it

era

tio

n

Xeon Classic

32 Ranks

Xeon Phi Classic

120 Ranks

Xeon Hybrid

(8x4 Ranks)

Xeon Phi SHM

240 Threads

(a) Performance comparison of one Intel Xeon Phi to andual-socket Intel Xeon E5-2670 node.

1.4

4

1.2

7

2.0

2

2.4

5

0.8

2

0.6

9

1.0

8 1.4

0

0.5

7

0.4

0 0.6

2

0.7

9

0

0.5

1

1.5

2

2.5

3


[s]

pe

r it

era

tio

n

1 Node 2 Nodes 4 Nodes

(b) Scalability across 1 to 4 nodes. Runtimes of 1.6 millionmolecules at rc = 7.0σ. One Xeon Phi card per node isused with 16 MPI ranks and 15 threads, while the hostsystem is running 16 MPI ranks with 2 threads each.

Fig. 32. Performance analysis of ls1 mardyn’s overall performance in hetergenous clus-ter with Intel Xeon Phi coprocessors and regular Intel Xeon ndoes.

and ran scenarios ranging from 5 to 500 million atoms of the liquid noble gaskrypton. The measured performance is shown in Fig. 33. We see that the size ofthe scenario has only a minimal impact on the performance if it is not chosen toosmall. In contrast to this, doubling the cutoff radius from 2.5σ to 5.0σ boosts thesimulation by more than a factor of two. Increasing it to just 3.5σ gives 60% moreperformance. Slightly less than 50% improvement is gained when comparing acutoff of 3.5σ to rc = 5.0σ. In order to highlight these differences we decidedto run several scenarios. First we cover mid-sized scaling tests on up to 32,768cores with rc = 5.0 on SuperMUC. Second, we scale out to the full machine andrestrict ourselves to the smaller cutoff radius rc = 3.5.

In case of the “small” benchmark runs we used N = 9.5 · 108 particles forstrong-scaling tests. Such a simulation requires at least two nodes of SuperMUC,as it consumes roughly 36 GB memory for particle data. N was chosen slightlyhigher than in the multi-center case as we process single-center particles and

61

200

300

400

500

600

700

800

5 105 205 305 405 505

GF

LOP

S

Millions of particles

GFLOPS rc=3.5 σ

GFLOPS rc=2.5 σ

GFLOPS rc=5.0 σ

Fig. 33. Influence of the cutoff radius rc on the obtained performance: GFLOPS de-pendeding on particle count and rc on 128 SuperMUC cores.

scale to twice as many cores. Additionally, we performed a weak-scaling analysiswith N = 1.6·107 particles per node which results in a simulation of N = 3.3·1011

particles on 2,048 nodes.When running the full-machine benchmark on SuperMUC we increased the

strong-scaling scenario to N = 4.8 · 109 particles, which fits on eight nodesoccupying 18 GB per node. Moreover, we performed a weak-scaling analysiswhich is scaled to the full SuperMUC machine. Due to MPI buffers on all nodes,we could only pack N = 4.52 · 108 on each node. Particularly, buffers for eagercommunication turned out to be the most limiting factor. Although we reducedthem to a bare minimum (64 MB buffer space for each process), roughly 1 GBper node had to be reserved as we used one MPI rank per SuperMUC core.

SuperMUC’s results of the small benchmark depicted in Fig. 34 can be dis-cussed rather quickly. In terms of parallel efficiency SuperMUC achieves an ex-cellent value of 98% in the weak-scaling scenario. Running on 32,768 cores (using65,636 threads), the simulation achieves 183 TFLOPS. A slightly different pic-ture is given by the strong-scaling numbers as SuperMUC’s parallel efficiencydecreases to 53% which corresponds to 113 TFLOPS. As discussed earlier, thislower scalabilty is due to SuperMUC’s network topology.

Finally, Fig. 35 shows that nearly perfect scaling was achieved for up to146,016 cores using 292,032 threads, in both weak- and strong-scaling scenarios.These jobs nearly used the full machine which has 147,456 cores. In case ofstrong scaling, due to exclusive use of the whole machine, a very good parallelefficiency of 42% comparing 128 to 146016 cores was measured. In this case, lessthan 20 MB (5.2 · 105 particles) of main memory per node, which fits basicallyinto the processors’ caches, were used. This excellent scaling behavior can beexplained by analyzing Fig. 33. Already for N = 3 · 108 particles (approx. 8% ofthe available memory) we are able to hit a performance of roughly 550 GFLOPSwhich we also obtained for N = 4.8 · 109. It should be pointed out that theperformance shows only a small decrease for systems containing fewer particles

62

0

1

10

100

1,000

TFLOPS

#cores

ideal-weak ideal-strongweak-scaling strong-scaling

(a) Strong and weak scaling runtimes onSuperMUC.

0%

2%

4%

6%

8%

10%

12%

14%

16%

pe

ak

eff

icie

ncy

#cores

weak-scaling strong-scaling

(b) Achieved peak performance efficiencyon SuperMUC.

Fig. 34. Strong and weak scaling of our single-center optimized ls1 mardyn version onSuperMUC using up to 32,768 cores with rc = 5.0σ.

0

1

10

100

1,000

1,024

2,048

4,096

8,192

16,384

32,768

65,536

131,072

TFLOPS

#cores

ideal-weak ideal-strong


(a) Strong and weak scaling runtimes onSuperMUC.

0%

2%

4%

6%

8%

10%

12%

128

256

512

1,024

2,048

4,096

8,192

16,384

32,768

49,248

65,600

81,312

97,152

112,896

130,000

146,016

pe

ak

eff

icie

ncy

#cores


(b) Achieved peak performance efficiencyon SuperMUC.

Fig. 35. Strong and weak scaling of our single-center optimized ls1 mardyn version onSuperMUC using up to 146,016 cores with rc = 3.5σ.

(reducing the particle system size by a factor of 100). For N = 107 we see adrop by 27% which only increases to the above-mentioned 58% when movingfrom 128 to 146,016 cores. The overall simulation time in this case was 1.5 s for10 time steps, out of which 0.43 s were communication time, which account for29% of 1.5 s overall runtime.

Moreover, we performed a weak-scaling analysis with 4.125 · 1012 particles,one time step taking roughly 40 s on 146,016 cores. This scenario occupies thevolume of a cube with edge-length of 6.3 micrometers and is therefore nearlyvisible. With simulations of that size, the direct comparison of lab experimentsand numerical simulations, both in the same order of magnitude, is within reachsoon. For the largest run, a parallel efficiency of 91.2% compared to a single corewith an absolute performance of 591.2 TFLOPS was achieved, which correspondsto 9.4% peak performance efficiency. As discussed earlier, the overall lower peakperformance efficiencies are a result of the small cutoff radius rc = 3.5σ.

63

5.4 Summary

In this section we thoroughly evaluated the different aspects of the proposed im-plementations. We demonstrated how a multi-center molecular dynamics appli-cation targeting chemical engineering applications, ls1 mardyn, can be acceler-ated by leveraging data-parallel SIMD vector instructions of modern computingdevices. Depending on the executed MD scenario, a time-to-solution speed-upof up to 3X was achieved in case of large-scale simulations. Even in complicatedvectorizable situations our enhanced version of ls1 mardyn runs at least twotimes faster. Furthermore, we evaluated emerging many-core and SIMD vectorplatforms featuring complex gather and scatter instructions by taking the IntelXeon Phi coprocessor as a proxy. Using gather and scatter instructions signif-icantly reduced the time for computing the interaction of particles. This hasto be payed off by a more complicated distance calculation combined with thegeneration of the gather and scatter offset vectors. Besides, we showed that theOpenMP-based parallelisation of the full simulation on Xeon Phi is fully func-tional, and allows the efficient hybrid execution among multiple nodes using bothhost system and coprocessor.

On standard and highly-optimized multi-core systems such as SuperMUC weachieve close to 20% peak performance which materializes a very good result.This is due to three circumstances caused by the selected Lennard-Jones-12-6potential for the force calculation. First, it is not very well instruction-balancedas it requires significantly more multiplications than additions which limits ILP.Second, a division is required which consumes more cycles than the rest of theforce computation. Third, the kernels exposes many instruction dependenciesthat limit instruction pipelining. Similar holds for the arithmetically even lessintense kernel for Coulomb interactions. Considering these obstacles, we concludethat the derived implementations can be considered as an optimal one on currentcompute devices. Finallly, we showcased that ls1 mardyn, and therefore MDin general, is able to unleash the compute power of modern multi-petaflopssupercomputers as it scaled with 92% efficiency at 600 TFLOPS performance tothe full SuperMUC machine.

64

6 Conclusion

In this book we described the current state of our work on the optimization ofmolecular dynamics simulations. In particular, we motivated the development ofa code specialized on its area of application, here the field of large-scale simu-lations in chemical engineering. We demonstrated how the molecular dynamicsapplication ls1 mardyn has been accelerated on various platforms by leveragingdata-parallel SIMD vector instructions of modern computing devices. Further-more, we evaluated emerging many-core and SIMD vector platforms featuringcomplex gather and scatter instructions by taking the Intel Xeon Phi coproces-sor as a proxy. The basis of that assessment has been a rigorous hardware-awarere-engineering based on todays and most likely also on future hardware charac-teristics carved out in Sect. 3.1 as well as the resulting software design principles.Since vectorization and parallelization had do be implemented explicitly, theirsmooth software technical integration with the existing code has been also afocus of our work.

The sliding window traversal, one of the main contributions of this work,forms the basis for the memory-efficient and runtime-efficient implementation ofan MD simulation based on the linked-cells algorithm. For the examples con-sidered there, memory consumption could be reduced by a factor of more thanfour. The linked-cells algorithm as the core algorithm of many MD simulationpackages has been tuned to the SSE/AVX vector instruction set extensions ofcurrent CPUs. Highly-optimized kernels have been developed for the computa-tion of the Lennard-Jones potential and the Coulomb potential. Depending onthe executed MD scenario, a time-to-solution speed-up of up to 3X was achievedin case of large-scale simulations. Even in complicated vectorizable situationsour enhanced version of ls1 mardyn runs at least two times faster.

Weak and strong scalability of the original and the optimized production ver-sion has been assessed on the Intel-based IBM System x iDataPlex cluster calledSuperMUC, located at the Leibniz Supercomputing Centre in Munich, and gen-erally very good scalability could be proven. This holds also for the case of strongscaling, which is of special practical importance for users, where good scaling be-havior could be observed down to 650 molecules per core on 16,384 cores in total.Further investigation of the performance of the vectorized version reveals thatperformance tends to increase for larger cut-off radii and particle numbers. Thisis a pleasant fact, as larger particle numbers and cut-off radii occur frequently inchemical engineering scenarios. These results confirm the decision to develop acode specialized on rigid-body MD for applications in chemical engineering. Thehighly specialized version for inert fluids, featuring vectorization, a light-weightshared-memory parallelization and memory-efficiency has been benchmarked.These experimental results impressively demonstrate the potential that can beunleashed by an optimal implementation on state-of-the-art hardware.

Making use of the same design principles and the same software-layout, whichallowed us the seamless integration of a different target platform, we demon-strated the potential of ls1 mardyn on the Intel Xeon Phi coprocessor. Here,an efficient vectorization is crucial to obtain good single-core performance. In

65

order to efficiently utilize the full coprocessor, a highly scalable shared-memoryparallelization is indispensable. With the described implementation we mea-sured approximately the same performance for one Xeon Phi card as for twoSandy Bridge processors, which is a very good result. Also here we would liketo point out that a fully optimized Load-balanced Distributed Memory Paral-lelization featuring Xeon Phi coprocessors would deliver the same performanceas roughly 250-280 SuperMUC nodes running ls1 mardyn’s original implemen-tation. Specifically, SuperMUC will feature a partition equipped with Intel XeonPhi coprocessors in its installation phase 2 – thus ls1 mardyn will also allowthe efficient usage of modern cluster systems in the near future.

The strict application of the concepts for memory and runtime efficiency al-lowed us to perform the world’s largest molecular dynamics simulation to date.Pushing down the memory required per molecule to only 32 Bytes, 4.125 · 1012

molecules have been simulated, using 146,016 cores on SuperMUC. That simu-lation achieved 591.2 TFLOPS in single-precision, i.e., a peak efficiency of 9.4 %at a parallel efficiency of 86.3 % compared to one node. That run impressivelydemonstrates that our work did not only contribute to but defined state-of-the-art in MD simulation.

It has been stated in the introduction that both implementations of algo-rithms and underlying algorithms themselves need to be adapted to currenthardware. In this work, efficient algorithms have been adapted and tuned tothe best available hardware. In doing so, the simulation code ls1 mardyn hasbeen improved considerably. Additionally, the experiences gained throughoutthis work, the methodological achievements, and the derived implementationshelp to progress the field of molecular dynamics simulation, especially in chem-ical engineering, beyond a single code.

66

References

1. The ls1 mardyn website, 2014. http://www.ls1-mardyn.de/.

2. B. J. Alder and T. E. Wainwright. Studies in Molecular Dynamics. I. GeneralMethod. The Journal of Chemical Physics, 31(2):459–466, 1959.

3. M. P. Allen and D. J. Tildesley. Computer Simulation of Liquids. Oxford Uni-versity Press, 1989.

4. J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose moleculardynamics simulations fully implemented on graphics processing units. Journal ofComputational Physics, 227:5342–5359, 2008.

5. A. Arnold, O. Lenz, S. Kesselheim, R. Weeber, F. Fahrenberger, D. Roehm,P. Kosovan, and C. Holm. Espresso 3.1: Molecular dynamics software for coarse-grained models. In Meshfree Methods for Partial Differential Equations VI, pages1–23. Springer, 2013.

6. J. Barker and R. Watts. Monte Carlo studies of the dielectric properties of water-like models. Molecular Physics, 26(3):789–792, 1973.

7. K. Benkert and F. Gahler. Molecular Dynamics on NEC Vector Systems, pages145–152. Springer, 2007.

8. M. Bernreuther and H.-J. Bungartz. Molecular Simulation of Fluid Flow on aCluster of Workstations. In F. Hulsemann, M. Kowarschik, and U. Rude, editors,Proceedings of the 18th Symposium Simulationstechnique (ASIM 2005), volume 15of Fortschritte in der Simulationstechnik - Frontiers in Simulation, pages 117–123.SCS European Publishing House, 2005.

9. D. Berthelot. Sur le melange des gaz. Comptes rendus hebdomadaires des seancesde l’Academie des Sciences, 126:1703–1706, 1898. Addendum: vol. 126, no. 4, pp.1857–1858.

10. K. Binder. Applications of Monte Carlo methods to statistical physics. Reportson Progress in Physics, 60(5):487–559, 1997.

11. K. J. Bowers, E. Chow, H. Xu, R. O. Dror, M. P. Eastwood, B. A. Gregersen, J. L.Klepeis, I. Kolossvary, M. A. Moraes, F. D. Sacerdoti, J. K. Salmon, Y. Shan,and D. E. Shaw. Scalable algorithms for molecular dynamics simulations oncommodity clusters. In Proceedings of the 2006 ACM/IEEE Conference on Su-percomputing, SC ’06, New York, NY, USA, 2006. ACM.

12. K. J. Bowers, R. O. Dror, and D. E. Shaw. Zonal methods for the parallelexecution of range-limited N-body simulations. Journal of Computational Physics,221(1):303–329, 2007.

13. B. R. Brooks, C. L. Brooks, A. D. Mackerell, L. Nilsson, R. J. Petrella, B. Roux,Y. Won, G. Archontis, C. Bartels, S. Boresch, et al. CHARMM: the biomolecularsimulation program. Journal of computational chemistry, 30(10):1545–1614, 2009.

14. M. Buchholz. Framework zur Parallelisierung von Molekulardynamiksimulatio-nen in verfahrenstechnischen Anwendungen. Dissertation, Institut fur Informatik,Technische Universitat Munchen, 2010.

15. D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M. Merz,A. Onufriev, C. Simmerling, B. Wang, and R. J. Woods. The Amber biomolecu-lar simulation programs. Journal of Computational Chemistry, 26(16):1668–1688,2005.

16. R. Clausius. XVI. On a mechanical theorem applicable to heat. PhilosophicalMagazine Series 4, 40(265):122–127, 1870.

67

http://www.ls1-mardyn.de/

17. S. Deublein, B. Eckl, J. Stoll, S. V. Lishchuk, G. Guevara-Carrion, C. W. Glass,T. Merker, M. Bernreuther, H. Hasse, and J. Vrabec. ms2: A molecular simu-lation tool for thermodynamic properties. Computer Physics Communications,182(11):2350–2367, 2011.

18. W. Eckhardt. Efficient HPC Implementations for Large-Scale Molecular Simu-lation in Process Engineering. PhD thesis, Institut fur Informatik, TechnischeUniversitat Munchen, Munchen, 2014. Dissertation available from publishinghouse Dr. Hut under ISBN 978-3-8439-1746-9.

19. W. Eckhardt and A. Heinecke. An efficient Vectorization of Linked-Cell ParticleSimulations. In ACM International Conference on Computing Frontiers, pages241–243, Cagliari, 2012.

20. W. Eckhardt, A. Heinecke, R. Bader, M. Brehm, N. Hammer, H. Huber, H.-G.Kleinhenz, J. Vrabec, H. Hasse, M. Horsch, M. Bernreuther, C. Glass, C. Ni-ethammer, A. Bode, and H.-J. Bungartz. 591 TFLOPS Multi-Trillion ParticlesSimulation on SuperMUC. In International Supercomputing Conference (ISC)Proceedings 2013, volume 7905 of Lecture Notes in Computer Science, pages 1–12, Leipzig, Germany, 2013. Springer.

21. W. Eckhardt, A. Heinecke, W. Holzl, and H.-J. Bungartz. Vectorization of Multi-Center, Highly-Parallel Rigid-Body Molecular Dynamics Simulations. In Super-computing 2013, The International Conference for High Performance Computing,Networking, Storage and Analysis, Denver, 2013. IEEE. Poster abstract.

22. W. Eckhardt and T. Neckel. Memory-Efficient Implementation of a Rigid-BodyMolecular Dynamics Simulation. In Proceedings of the 11th International Sym-posium on Parallel and Distributed Computing - ISPDC 2012, pages 103–110,Munich, 2012. IEEE.

23. B. Eckl, J. Vrabec, and H. Hasse. On the application of force fields for predicting awide variety of properties: Ethylene oxide as an example. Fluid Phase Equilibria,274(1–2):16–26, 2008.

24. B. Eckl, J. Vrabec, and H. Hasse. On the application of force fields for predicting awide variety of properties: Ethylene oxide as an example. Fluid Phase Equilibria,274(1–2):16–26, 2008.

25. B. Eckl, J. Vrabec, and H. Hasse. Set of molecular models based on quantummechanical ab initio calculations and thermodynamic data. Journal of PhysicalChemistry B, 112(40):12710–12721, 2008.

26. R. Everaers and K. Kremer. A fast grid search algorithm for molecular dynamicssimulations with short-range interactions. Computer Physics Communications,81(12):19–55, 1994.

27. D. Fincham. Leapfrog Rotational Algorithms. Molecular Simulation, 8:165–178,1992.

28. T. R. Forester and W. Smith. SHAKE, rattle, and roll: Efficient constraint algo-rithms for linked rigid bodies. Journal of Computational Chemistry, 19(1):102–111, 1998.

29. E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements ofReusable Object-Oriented Software. Addison-Wesley, 1994.

30. T. C. Germann and K. Kadau. TRILLION-ATOM MOLECULAR DYNAM-ICS BECOMES A REALITY. International Journal of Modern Physics C,19(09):1315–1319, 2008.

31. P. Gonnet. A simple algorithm to accelerate the computation of non-bonded inter-actions in cell-based molecular dynamics simulations. Journal of ComputationalChemistry, 28(2):570–573, 2007.

68

32. C. G. Gray and K. E. Gubbins. Theory of Molecular Fluids. Volume 1: Funda-mentals. Clarendon Press, Oxford, 1984.

33. G. S. Grest, B. Dnweg, and K. Kremer. Vectorized link cell Fortran code formolecular dynamics simulations for a large number of particles. Computer PhysicsCommunications, 55(3):269 – 285, 1989.

34. M. Griebel, S. Knapek, and G. W. Zumbusch. Numerical simulation in moleculardynamics: numerics, algorithms, parallelization, applications, volume 5. Springer,2007.

35. K. E. Gubbins and J. D. Moore. Molecular Modeling of Matter: Impactand Prospects in Engineering. Industrial & Engineering Chemistry Research,49(7):3026–3046, 2010.

36. G. Guevara Carrion, H. Hasse, and J. Vrabec. Thermodynamic properties forapplications in chemical industry via classical force fields. In Multiscale MolecularMethods in Applied Chemistry, number 307 in Topics in Current Chemistry, pages201–249, Heidelberg, 2012. Springer.

37. A. Heinecke. Boosting Scientific Computing Applications through Leveraging DataParallel Architectures. PhD thesis, Institut fur Informatik, Technische UniversitatMunchen, 2014. Dissertation available from publishing house Dr. Hut under ISBN978-3-8439-1408-6.

38. E. Hendriks, G. M. Kontogeorgis, R. Dohrn, J.-C. de Hemptinne, I. G. Economou,L. F. Zilnik, and V. Vesovic. Industrial requirements for thermodynam-ics and transport properties. Industrial & Engineering Chemistry Research,49(22):11131–11141, 2010.

39. J. L. Hennessy and D. A. Patterson. Computer Architecture - A QuantitativeApproach (5th ed.). Morgan Kaufmann, 2012.

40. B. Hess. P-LINCS: A Parallel Linear Constraint Solver for Molecular Simulation.Journal of Chemical Theory and Computation, 4(1):116–122, 2008.

41. B. Hess, C. Kutzner, D. van der Spoel, and E. Lindahl. Gromacs 4: Algorithmsfor Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. Journalof Chemical Theory and Computation, 4(3):435–447, 2008.

42. R. Hockney, S. Goel, and J. Eastwood. Quiet high-resolution computer modelsof a plasma. Journal of Computational Physics, 14(2):148–158, 1974.

43. M. Horsch, J. Vrabec, M. Bernreuther, and H. Hasse. Poiseuille flow of liquidmethane in nanoscopic graphite channels by molecular dynamics simulation. InK. Hanjalic, editor, Proceedings of the 6th International Symposium on Turbu-lence, Heat and Mass Transfer, pages 89–92, New York, 2009. Begell House.

44. P. H. Hunenberger. Thermostat algorithms for molecular dynamics simulations.In Advanced Computer Simulation, pages 105–149. Springer, 2005.

45. Intel Cooperation. Intel(R) MPI Library for Linux OS, Version 4.1 Update 1,2013.

46. W. L. Jorgensen and J. Tirado-Rives. The OPLS [optimized potentials for liq-uid simulations] potential functions for proteins, energy minimizations for crys-tals of cyclic peptides and crambin. Journal of the American Chemical Society,110(6):1657–1666, 1988.

47. I. Kabadshow, H. Dachsel, and J. Hammond. Poster: Passing the three trillionparticle limit with an error-controlled fast multipole method. In Proceedings ofthe 2011 companion on High Performance Computing Networking, Storage andAnalysis Companion, SC ’11 Companion, pages 73–74, New York, NY, USA, 2011.ACM.

69

48. K. Kadau, T. C. Germann, and P. S. Lomdahl. Molecular dynamics comes of age:320 billion atom simulation on BlueGene/L. International Journal of ModernPhysics C, 17(12):1755–1761, 2006.

49. R. K. Kalia, S. de Leeuw, A. Nakano, and P. Vashishta. Molecular-dynamics simu-lations of Coulombic systems on distributed-memory MIMD machines. ComputerPhysics Communications, 74(3):316–326, 1993.

50. D. B. Kitchen, H. Decornez, J. R. Furr, and J. Bajorath. Docking and scoring invirtual screening for drug discovery: methods and applications. Nature reviewsDrug discovery, 3(11):935–949, 2004.

51. O. Konrad. Molekulardynamische Simulationen zur Solvation von Methan inWasser. PhD thesis, Universitat Hamburg, 2008.

52. J. B. Kuipers. Quaternions and rotation sequences. Princeton university pressPrinceton, 1999.

53. M. Kunaseth, D. F. Richards, J. N. Glosli, R. K. Kalia, A. Nakano, andP. Vashishta. Analysis of scalable data-privatization threading algorithms forhybrid MPI/OpenMP parallelization of molecular dynamics. The Journal of Su-percomputing, pages 1–25, 2013.

54. E. Lindahl, B. Hess, and D. van der Spoel. GROMACS 3.0: a package for molecu-lar simulation and trajectory analysis. Journal of Molecular Modeling, 7:306–317,2001.

55. Y. Liu, C. Hu, and C. Zhao. Efficient parallel implementation of Ewald summationin molecular dynamics simulations on multi-core platforms. Computer PhysicsCommunications, 182(5):1111–1119, 2011.

56. H. A. Lorentz. Uber die Anwendung des Satzes vom Virial in der kinetischenTheorie der Gase. Annalen der Physik, 12(1):127–136, 1881. Addendum: vol. 12,no. 4, pp. 660–661.

57. R. Lustig. Direct molecular NVT simulation of the isobaric heat capacity, speedof sound, and Joule-Thomson coefficient. Molecular Simulation, 37(6):457–465,2011.

58. M. G. Martin and J. I. Siepmann. Novel configurational-bias Monte Carlomethod for branched molecules. Transferable potentials for phase equilibria. 2.United-atom description of branched alkanes. Journal of Physical Chemistry B,103(21):4508–4517, 1999.

59. J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving Memory HierarchyPerformance for Irregular Applications Using Data and Computation Reorder-ings. International Journal of Parallel Programming, 29:217–247, 2001.

60. S. Meloni, M. Rosati, and L. Colombo. Efficient particle labelling in atomisticsimulations. The Journal of Chemical Physics, 126(12), 2007.

61. T. Merker, C. Engin, J. Vrabec, and H. Hasse. Molecular model for carbon dioxideoptimized to vapor-liquid equilibria. The Journal of Chemical Physics, 132(23),2010.

62. H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. Top500 list, June 2013,2013. http://www.top500.org, accessed 2013-06-23.

63. M. T. Nelson, W. Humphrey, A. Gursoy, A. Dalke, L. V. Kale, R. D. Skeel, andK. Schulten. NAMD: a parallel, object-oriented molecular dynamics program.International Journal of High Performance Computing Applications, 10(4):251–268, 1996.

64. C. Niethammer, S. Becker, M. Bernreuther, M. Buchholz, W. Eckhardt, A. Hei-necke, S. Werth, H.-J. Bungartz, C. W. Glass, H. Hasse, J. Vrabec, and M. Horsch.ls1 mardyn: The massively parallel molecular dynamics code for large systems.Journal of Chemical Theory and Computation, 2014.

70

http://www.top500.org

65. S. Olivier, J. Prins, J. Derby, and K. Vu. Porting the GROMACS MolecularDynamics Code to the Cell Processor. In Parallel and Distributed ProcessingSymposium, 2007. IPDPS 2007. IEEE International, pages 1 –8, 2007.

66. OpenMP Architecture Review Board. OpenMP Application Program InterfaceVersion 3.0, 2008.

67. L. Peng, M. Kunaseth, H. Dursun, K.-i. Nomura, W. Wang, R. Kalia, A. Nakano,and P. Vashishta. Exploiting hierarchical parallelisms for molecular dynamicssimulation on multicore clusters. The Journal of Supercomputing, 57:20–33, 2011.

68. S. Pennycook, C. Hughes, M. Smelyanskiy, and S. Jarvis. Exploring SIMD forMolecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi Coproces-sors. In IEEE 27th International Symposium on Parallel Distributed Processing(IPDPS), 2013, pages 1085–1097, 2013.

69. J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa,C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable molecular dynam-ics with NAMD. Journal of Computational Chemistry, 26(16):1781–1802, 2005.

70. S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journalof Computational Physics, 117(1):1–19, 1995.

71. S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subra-moni, and D. K. D. Panda. MVAPICH-PRISM: A Proxy-based CommunicationFramework Using InfiniBand and SCIF for Intel MIC Clusters. In Proceedingsof the International Conference on High Performance Computing, Networking,Storage and Analysis, SC ’13, pages 1–11, New York, NY, USA, 2013. ACM.

72. S. Pll and B. Hess. A flexible algorithm for calculating pair interactions on SIMDarchitectures. Computer Physics Communications, 2013. accepted for publication.

73. A. Rahimian, I. Lashuk, S. Veerapaneni, A. Chandramowlishwaran, D. Malho-tra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, andG. Biros. Petascale Direct Numerical Simulation of Blood Flow on 200K Coresand Heterogeneous Architectures. In Proceedings of the 2010 ACM/IEEE In-ternational Conference for High Performance Computing, Networking, Storageand Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE ComputerSociety.

74. A. Rahman. Correlations in the motion of atoms in liquid argon. Physical Review,136A(2A):405–411, 1964.

75. D. C. Rapaport. Large-scale molecular dynamics simulation using vector andparallel computers. Computer Physics Reports, 9:1–53, 1988.

76. D. C. Rapaport. The Art of Molecular Dynamics Simulation. Cambridge Univer-sity Press, 2004.

77. D. C. Rapaport. Multibillion-atom molecular dynamics simulation: Design con-siderations for vector-parallel processing. Computer Physics Communications,174(7):521–529, 2006.

78. D. C. Rapaport. Enhanced molecular dynamics performance with a pro-grammable graphics processor. Computer Physics Communications, 182(4):926–934, 2011.

79. J. Reinders. Intel threading building blocks. O’Reilly & Associates, Inc., Se-bastopol, CA, USA, first edition, 2007.

80. F. Rosch and H.-R. Trebin. Crack front propagation by kink formation. Euro-physics Letters, 87:66004, 2009.

81. E. Rotenberg, S. Bennett, and J. Smith. Trace cache: a low latency approachto high bandwidth instruction fetching. In MICRO-29.Proceedings of the 29thAnnual IEEE/ACM International Symposium on Microarchitecture, 1996., pages24–34, 1996.

71

82. J. Roth, F. Gahler, and H.-R. Trebin. A molecular dynamics run with 5 180 116000 particles. International Journal of Modern Physics C, 11(02):317–322, 2000.

83. J.-P. Ryckaert, G. Ciccotti, and H. J. Berendsen. Numerical integration of thecartesian equations of motion of a system with constraints: molecular dynamicsof n-alkanes. Journal of Computational Physics, 23(3):327–341, 1977.

84. R. Salomon-Ferrer, D. A. Case, and R. C. Walker. An overview of the Amberbiomolecular simulation package. Wiley Interdisciplinary Reviews: ComputationalMolecular Science, 2012.

85. N. Schmid, A. P. Eichenberger, A. Choutko, S. Riniker, M. Winger, A. Mark, andW. Gunsteren. Definition and testing of the GROMOS force-field versions 54A7and 54B7. European Biophysics Journal, 40(7):843–856, 2011.

86. T. Schnabel, J. Vrabec, and H. Hasse. Unlike Lennard-Jones parameters forvapor-liquid equilibria. Journal of Molecular Liquids, 135:170–178, 2007.

87. M. Schoen. Structure of a simple molecular dynamics FORTRAN program op-timized for CRAY vector processing computers. Computer Physics Communica-tions, 52(2):175 – 185, 1989.

88. P. Schofield. Computer simulation studies of the liquid state. Computer PhysicsCommunications, 5(1):17–23, 1973.

89. D. E. Shaw, M. M. Deneroff, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon,C. Young, B. Batson, K. J. Bowers, J. C. Chao, M. P. Eastwood, J. Gagliardo,J. P. Grossman, C. R. Ho, D. J. Ierardi, I. Kolossvary, J. L. Klepeis, T. Layman,C. McLeavey, M. A. Moraes, R. Mueller, E. C. Priest, Y. Shan, J. Spengler,M. Theobald, B. Towles, and S. C. Wang. Anton, a special-purpose machinefor molecular dynamics simulation. In ACM SIGARCH Computer ArchitectureNews, volume 35, pages 1–12. ACM, 2007.

90. D. E. Shaw, R. O. Dror, J. K. Salmon, J. Grossman, K. M. Mackenzie, J. A.Bank, C. Young, M. M. Deneroff, B. Batson, K. J. Bowers, et al. Millisecond-scale molecular dynamics simulations on Anton. In High Performance ComputingNetworking, Storage and Analysis, Proceedings of the Conference on, pages 1–11.IEEE, 2009.

91. W. Smith. A replicated data molecular dynamics strategy for the parallel Ewaldsum. Computer Physics Communications, 67(3):392–406, 1992.

92. J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, andK. Schulten. Accelerating Molecular Modeling Applications with Graphics Pro-cessors. Journal of Computational Chemistry, 28:2618–2640, 2007.

93. F. Streitz, J. Gosli, M. Patel, B. Chan, R. Yates, B. de Supinski, J. Sexton,and J. Gunnels. 100+ TFLOP solidification simulations on BlueGene/L. InProceedings of IEEE/ACM Supercomputing ’05, 2005.

94. R. Susukita, T. Ebisuzaki, B. G. Elmegreen, H. Furusawa, K. Kato, A. Kawai,Y. Kobayashi, T. Koishi, G. D. McNiven, T. Narumi, and K. Yasuoka. Hardwareaccelerator for molecular dynamics: MDGRAPE-2. Computer Physics Commu-nications, 155(2):115–131, 2003.

95. G. Sutmann and V. Stegailov. Optimization of neighbor list techniques in liquidmatter simulations. Journal of Molecular Liquids, 125:197–203, 2006.

96. P. Ungerer, C. Nieto Draghi, B. Rousseau, G. Ahunbay, and V. Lachet. Molecularsimulation of the thermophysical properties of fluids: From understanding towardquantitative predictions. Journal of Molecular Liquids, 134:71–89, 2007.

97. D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, and H. J. C.Berendsen. GROMACS: Fast, flexible, and free. Journal of Computational Chem-istry, 26(16):1701–1718, 2005.

72

98. J. van Meel, A. Arnold, D. Frenkel, S. Portegies Zwart, and R. Belleman. Har-vesting graphics power for MD simulations. Molecular Simulation, 34(3):259–266,2008.

99. L. Verlet. Computer ”Experiments” on Classical Fluids. I. ThermodynamicalProperties of Lennard-Jones Molecules. Physical Review Online Archive (Prola),159(1):98–103, 1967.

100. U. Welling and G. Germano. Efficiency of linked cell algorithms. ComputerPhysics Communications, 182(3):611–615, 2011.

101. L. Woodcock. Isothermal molecular dynamics calculations for liquid salts. Chem-ical Physics Letters, 10(3):257–261, 1971.

Optimization Notice: Software and workloads used in performance tests mayhave been optimized for performance only on Intel microprocessors. Performancetests, such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any ofthose factors may cause the results to vary. You should consult other informa-tion and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with otherproducts. For more information go to http://www.intel.com/performance. Intel,Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/orother countries.

73

Supercomputing for Molecular Dynamics Simulations ... · a standard tool complementing the two...

Documents

Transcript of Supercomputing for Molecular Dynamics Simulations ... · a standard tool complementing the two...