Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Parallel Molecular Dynamics

Application Oriented

Computer Science Research

Laxmikant Kale

http://charm.cs.uiuc.edu

Outline

• What is needed for HPC to succeed?

• Parallelization of Molecular Dynamics– Aggressive Parallel decomposition– Load Balancing and performance– Multiparadigm programming

• Collaborative Interdisciplinary Research– Comments and lessons

Contributors

• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel

• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy,

Bill Humphrey, Mark Nelson

• NAMD2:– M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips,

N.Krawetz, A. Shinozaki, K. Varadarajan,

Parallel Computing Research

• Trends: – application centered CS research– Isolated CS research

• Drawback of both

• Needed:– Computer Science centered, yet application

oriented research

Middle layers

Applications

Parallel Machines

“Middle Layers”:Languages, Tools, Libraries

Molecular Dynamics• Collection of [charged] atoms, with bonds

• Newtonian mechanics

• At each time-step– Calculate forces on each atom

• bonds:

• non-bonded: electrostatic and van der Waal’s

– Calculate velocities and Advance positions

• 1 femtosecond time-step, millions needed!

• Thousands of atoms (1,000 - 100,000)

Further MD

• Use of cut-off radius to reduce work– 8 - 14 Å– Faraway charges ignored!

• 80-95 % work is non-bonded force computations

• Some simulations need faraway contributions

Scalability

• The Program should scale up to use a large number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable

• Better definition of scalability:– If I double the number of processors, I should

be able to retain parallel efficiency by increasing the problem size

Isoefficiency

• Quantify scalability

• How much increase in problem size is needed to retain the same efficiency on a larger machine?

• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =

• computation + communication + idle

Traditional Approaches

• Replicated Data:– All atom coordinates stored on each processor– Non-bonded Forces distributed evenly– Analysis: Assume N atoms, P processors

• Computation: O(N/P)

• Communication: O(N log P)

• Communication/Computation ratio: P log P

• Fraction of communication increases with number of processors, independent of problem size!

Atom decomposition

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor– Communication: O(N) per processor– Communication/Computation: O(P)

Force Decomposition

• Distribute force matrix to processors– Matrix is sparse, non uniform– Each processor has one block– Communication: N/sqrt(P)– Ratio: sqrt(P)

• Better scalability (can use 100+ processors)– Hwang, Saltz, et al: – 6% on 32 Pes 36% on 128 processor

Spatial Decomposition

• Allocate close-by atoms to the same processor

• Three variations possible:– Partitioning into P boxes, 1 per processor

• Good scalability, but hard to implement

– Partitioning into fixed size boxes, each a little larger than the cutoff disctance

– Partitioning into smaller boxes

• Communication: O(N/P)

Spatial Decomposition in NAMD

• NAMD 1 used spatial decomposition

• Good theoretical isoefficiency, but for a fixed size system, load balancing problems

• For midsize systems, got good speedups up to 16 processors….

• Use the symmetry of Newton’s 3rd law to facilitate load balancing

Spatial Decomposition

FD + SD

• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor– Number of diamonds (3D): – 14·Number of Patches

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..– Luckily, each involves atoms in neighboring

patches only

• Straightforward implementation:– Send message to all neighbors,– receive forces from them– 26*2 messages per patch!

Bonded Forces:

• Assume one patch per processor

B

CA

Implementation

• Multiple Objects per processor– Different types: patches, pairwise forces,

bonded forces,– Each may have its data ready at different times– Need ability to map and remap them– Need prioritized scheduling

• Charm++ supports all of these

Charm++

• Data Driven Objects

• Object Groups: – global object with a “representative” on each PE

• Asynchronous method invocation

• Prioritized scheduling

• Mature, robust, portable

• http://charm.cs.uiuc.edu

Data driven execution

Scheduler Scheduler

Message Q Message Q

Load Balancing

• Is a major challenge for this application– especially for a large number of processors

• Unpredictable workloads– Each diamond (force object) and patch

encapsulate variable amount of work– Static estimates are inaccurate

• Measurement based Load Balancing– Very slow variations across timesteps

Bipartite graph balancing

• Background load:– Patches and angle forces

• Migratable load:– Non-bonded forces

• Bipartite communication graph – between migratable and non-migratable objects

• Challenge:– Balance Load while minimizing communication

Load balancing

• Collect timing data for several cycles

• Run heuristic load balancer– Several alternative ones

• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration

• Needs a separate talk!

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

Time

migratable work

non-migratable work

Before and After

Before and After

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Processors

Time migratable work

non-migratable work

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

Time

migratable work

non-migratable work

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Processors

Time migratable work

non-migratable work

Performance: size of system

# ofatoms

Procs 1 2 4 8 16 32 64 128 160

bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms

Speedup 1.0 1.97 3.61 7.20 13.2 23.7

ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms

Speedup (3.88) 7.64 14.7 28.4 57.3 109 130

Performance: various machines

Procs 1 2 4 8 16 32 64 128 160 192

T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098

- ---------

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152

2000-------

Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3

ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196

Red ---------

Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143

NOWs Time 24.1 12.4 6.39 3.69

HP735/125

Speedup 1.0 1.94 3.77 6.54

Speedup

0

20

40

60

80

100

120

140

160

180

200

220

240

0 20 40 60 80 100 120 140 160 180 200 220 240

Processors

Speedup

Speedup

Perfect Speedup

Multi-paradigm programming

• Long-range electrostatic interactions– Some simulations require this– Contributions of faraway atoms can be

calculated infrequently– PVM based library, DPMTA

• developed at Duke by John Board et al

• Patch life cycle• Better expressed as a thread

Converse

• Supports multi-paradigm programming

• Provides portability

• Makes it easy to implement RTS for new paradigms

• Several languages/libraries:– Charm++, threaded MPI, PVM, Java, md-perl,

pc++, Nexus, Path, Cid, CC++, DP, Agents,..

Namd2 with Converse

NAMD2

• In production use – Internally for about a year– Several simulations completed/published

• Fastest MD program? We think so

• Modifiable/extensible– Steered MD– Free energy calculations

Lessons for CSE

• Technical lessons– Multiple-domain (patch) decomposition provides

necessary flexibility – Data driven objects and threads is a great combo– Measurement based load balancing is better– Multi-paradigm parallel programming works!

• Integrate independently developed libraries

• Use appropriate paradigm for each component

Real Application?

• Drawbacks– Need to spend effort on mundane details not

germane to CS research– Production program: complicates structure

Real Application for CS research?

• Benefits– Subtle and complex research problems uncovered

only with real application– Satisfaction of “real” concrete contribution– With careful planning, you can truly enrich the

“middle layers”– Bring back a rich variety of relevant CS problems– Apply to other domains: Rockets? Casting?

Collaboration lessons

• Use conservative methods..– C++: fashionable vs. conservative– Aggressive methods where they matter

• Account for differing priorities and objectives

Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Documents

Transcript of Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .