IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...

48
IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson Research Center, Yorktown Heights, NY [email protected]

Transcript of IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...

Page 1: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Sameer Kumar, Blue Gene Software Group,IBM T J Watson Research Center,Yorktown Heights, [email protected]

Page 2: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation2

Outline

Motivation

NAMD and Charm++

BGL Techniques

– Problem mapping

– Overlap of communication with computation

– Grain size

– Load-balancing

– Communication optimizations

Summary

Page 3: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

Blue Gene/L

Page 4: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

Blue Gene/L

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s16 GB

32 Node Cards

2.8/5.6 TF/s512 GB

64 Racks, 64x32x32

180/360 TF/s32 TB

Rack

System

Node Card

Compute Card

Chip

Page 5: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation5

Application Scaling

Weak

– Problem size increases with processors

Strong

– Constant problem size

– Linear to sub-linear decrease in computation time with processors

– Cache performance

– Communication overhead• Communication to computation ratio

Page 6: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation6

Scaling on Blue Gene/L

Several applications have demonstrated weak scaling

Strong scaling on a large number of benchmarks still needs to be achieved

Page 7: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

NAMD and Charm++

Page 8: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation8

NAMD: A Production MD program

NAMD

Fully featured program

NIH-funded development

Distributed free of charge (thousands downloads so far)

Binaries and source code

Installed at NSF centers

User training and support

Large published simulations (e.g., aquaporin simulation featured in keynote)

Page 9: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation9

NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns equilibration, 4ns production

Protein:~ 15,000 atomsLipids (POPE): ~ 40,000 atomsWater: ~ 51,000 atomsTotal: ~ 106,000 atoms

3.5 days / ns - 128 O2000 CPUs11 days / ns - 32 Linux CPUs.35 days/ns–512 LeMieux CPUs

Acquaporin Simulation

F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

Page 10: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation10

Molecular Dynamics in NAMD

Collection of [charged] atoms, with bonds

– Newtonian mechanics

– Thousands of atoms (10,000 - 500,000)

At each time-step

– Calculate forces on each atom

• Bonds:• Non-bonded: electrostatic and van der Waal’s

– Short-distance: every timestep– Long-distance: using PME (3D FFT)– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

Challenge: femtosecond time-step, millions needed!

Page 11: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation11

NAMD Benchmarks

BPTI3K atoms

Estrogen Receptor36K atoms (1996)

ATP Synthase327K atoms

(2001)

Page 12: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation12

Parallel MD: Easy or Hard?

Easy

– Tiny working data

– Spatial locality

– Uniform atom density

– Persistent repetition

– Multiple time-stepping

Hard

– Sequential timesteps

– Very short iteration time

– Full electrostatics

– Fixed problem size

– Dynamic variations

Page 13: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation13

NAMD Computation

Application data divided into data objects called patches

– Sub-grids determined by cutoff

Computation performed by migratable computes

– 13 computes per patch pair and hence much more parallelism

– Computes can be further split to increase parallelism

Page 14: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation14

NAMD

Scalable molecular dynamics simulation

2 types of objects: patches and computes, to expose more parallelism

Requires more careful load balancing

Page 15: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation15

Communication to Computation Ratio

Scalable

– Constant with number of processors

– In practice grows at a very small rate

Page 16: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation16

Charm++ and Converse

Charm++: object-based asynchronous message-driven parallel programming paradigm

Converse: communication layer for Charm++

– Send, recv, progress, on node level

User ViewSystem implementation

Network

Scheduler

Recv Msg Q

obj

obj

obj

obj

obj

Send Msg Q

Interface

obj

Page 17: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

Optimizing NAMD on Blue Gene/L

Page 18: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation18

Single Processor Performance

Worked with IBM Toronto for 3 weeks

– Inner loops slightly altered to enable software pipelining

– Aliasing issues resolved through the use of

#pragma disjoint (*ptr1, *ptr2)

– 40% serial speedup

– Current best performance is with 440

Continued efforts with Toronto to get good 440d performance

Page 19: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation19

NAMD on BGL

Advantages

– Both application and hardware are 3D grids

– Large 4MB L3 cache

• On large number of processors NAMD will run from L3

– Higher bandwidth for short messages

• Midpoint of peak bandwidth achieved quickly

– Six outgoing links from each node

– No OS Daemons

Page 20: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation20

NAMD on BGL

Disadvantages

– Slow embedded CPU

– Small memory per node

– Low bisection bandwidth

– Hard to scale full electrostatics

– Limited support for overlap of computation and communication• No cache coherence

Page 21: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation21

BGL Parallelization

Topology driven problem mapping

Load-balancing schemes

Overlap of computation and communication

Communication optimizations

Page 22: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation22

Problem Mapping

X

Y

Z

X

Y

Z

Application Data Space Processor Grid

Page 23: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation23

Problem Mapping

X

Y

Z

X

Y

Z

Application Data Space Processor Grid

Page 24: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation24

Problem Mapping

Application Data SpaceX

Y

Z

Processor Grid

Y

X

Z

Page 25: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation25

Problem Mapping

X

Y

Z

Processor Grid

Data Objects

Cutoff-driven Compute Objects

Page 26: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation26

Two Away Computation

Each data object (patch) is split along a dimension

– Patches now interact with neighbors of neighbors

– Makes application more fine grained

– Improves load balancing

– Messages of smaller size sent to more processors

– Improves torus bandwidth

Page 27: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation27

Two Away X

Page 28: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation28

Load Balancing Steps

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

Page 29: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation29

Load-balancing Metrics

Balancing load

Minimizing communication hop-bytes

– Place computes close to patches

– Biased through placement of proxies on near neighbors

Minimizing number of proxies

– Effects connectivity of each data object

Page 30: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation30

Overlap of Computation and Communication

Each FIFO has 4 packet buffers

Progress engine should be called every 4400 cycles

Overhead of about 200 cycles

– 5 % increase in computation

Remaining time can be used for computation

Page 31: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation31

Network Progress Calls

NAMD makes progress engine calls from the compute loops

– Typical frequency is10000 cycles, dynamically tunable

for ( i = 0; i < (i_upper SELF(- 1)); ++i ){

CmiNetworkProgress();

const CompAtom &p_i = p_0[i];

//……………………………

//Compute Pairlists

for (k=0; k<npairi; ++k) {

//Compute forces

}

}

void CmiNetworkProgress() {

new_time = rts_get_timebase();

if(new_time < lastProgress + PERIOD) {

lastProgress = new_time;

return;

}

lastProgress = new_time;

AdvanceCommunication();

}

Page 32: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation32

MPI Scalability

Charm++ MPI Driver

– Iprobe based implementation

– Higher progress overhead of MPI_Test

– Statically pinned FIFOs for point to point communication

Page 33: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation33

Charm++ Native Driver

BGX Message Layer (developed by George Almasi)

– Lower progress overhead

– Active messages• Easily design complex communication protocols

– Dynamic FIFO mapping

– Low overhead remote memory access

– Interrupts

– Charm++ BGX driver was developed by Chao Huang over this summer

Page 34: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation34

BG/L Msglayer

Advance loop

post

0

1

2

n-1

ScratchpadMsq queue

TorusMsq queue

CollectiveMsq queue

Msg Queues

Torus FIFOs

0 1 2 H

0 1 2 H

I0

I1

R0

R1

x+ x- y+ y- z+ z- H

x+ x- y+ y- z+ z- H

Coll. network FIFO

Network

FIFOpinning

Torus pkt. registry

012

…p

Coll. pkt. disp.

Dispatching

TorusMessageTreeMessageSpadMessage

Messages

TorusPacket

Deterministicallyrouted packet

Dynamicallyrouted packet

Packets

TreePacket

packets

Templates

TorusDirectMessage<>

( This slide is taken from G. Almási’s talk on the “new” msglayer. )

Page 35: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation35

Optimized Multicast

pinFifo Algorithms

– Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}

– Cones, Chessboard

Dynamic FIFO mapping

– A special send queue that msg can go from whichever FIFO that is not full

Page 36: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation36

Communication Pattern in PME

108

procs

108 procs

Page 37: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation37

PME

Plane decomposition for 3D-FFT

PME objects placed close to patch objects on the torus

PME optimized through an asynchronous all-to-all with dynamic FIFO mapping

Page 38: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

Performance Results

Page 39: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation39

BGX Message layer vs MPI

# NodesCutoff with PME

Msglayer MPI* Msglayer MPI*

4 2250 2250

32 314 316 356 371

128 85 91.6 103

512 22.7 23.8 26.7 27.8

1024 13.2 13.9 14.4 17.3

2048 7.9 8.1 9.7 10.2

4096 4.8 4.9 6.8 7.3

NAMD Co-Processor Mode Performance (ms/step)

Message layer has sender side blocking communication here

APoA1 Benchmark

Fully non-blocking version performed below par on MPI

– Polling overhead high for a list of posted receives

BGX message layer works well with asynchronous communication

Page 40: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation40

Blocking vs Overlap

# Nodes

Cutoff with PME

Blocking Sender Non-Blocking Blocking Sender Non-Blocking

32 314 313 356 347

128 85 82 103 97.2

512 22.7 21.7 26.7 23.7

1024 13.2 11.9 14.4 13.8

2048 7.9 7.3 9.7 8.6

4096 4.8 4.3 6.8 6.2

8192 - 3.7 - -

APoA1 Benchmark in Co-Processor Mode

Page 41: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation41

Effect of Network Progress

(Projections timeline of a 1024-node run without aggressive network progress)

Network progress not aggressive enough: communication gaps eat up utilization

Page 42: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation42

Effect of Network Progress (2)

(Projections timeline of a 1024-node run with aggressive network progress)

More frequent advance closes gaps

Page 43: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation43

Virtual Node Mode

0

20

40

60

80

100

120

128 512 1024 2048 4096

CP

VN

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME

Page 44: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation44

1

10

100

1000

32 64 128 256 512 1024 2048 4096

Spring

Now

Spring vs Now

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME

Page 45: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation

Summary

Page 46: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation46

Summary

Demonstrated good scaling to 4k processors for the APoA1 with a speedup of 2100

– Still working on 8k results

ATPase scales well to 8k processors with a speedup of 4000+

Page 47: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation47

Lessons Learnt

Eager messages lead to contention

Rendezvous messages don’t perform well with mid size messages

Topology optimizations are a big winner

Overlap of computation and communication is possible

– Overlap however makes compute load less predictable

Lack of operating system daemons leads to massive scaling

Page 48: IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson.

IBM Research

© 2005 IBM Corporation48

Future Plans

Experiment with new communication protocols

– Remote memory access

– Adaptive eager

– Fast asynchronous collectives

Improve load-balancing

– Newer distributed strategies

– Heavy processors dynamically unload to neighbors

Pencil decomposition for PME

Using the double hummer