1 Optimizing Quantum Chemistry using Charm++ Eric Bohm Parallel Programming Laboratory Department...

download 1 Optimizing Quantum Chemistry using Charm++ Eric Bohm  Parallel Programming Laboratory Department of Computer Science University.

If you can't read please download the document

description

3 Quantum Chemistry LeanCP Collaboration Glenn Martyna (IBM TJ Watson) Mark Tuckerman (NYU) Nick Nystrom (PSU) PPL: Kale, Shi, Bohm, Pauli, Kumar (now at IBM), Vadali CPMD Method Plane wave QM : 100s of atoms Charm++ Parallelization PINY MD Physics engine

Transcript of 1 Optimizing Quantum Chemistry using Charm++ Eric Bohm Parallel Programming Laboratory Department...

1 Optimizing Quantum Chemistry using Charm++ Eric BohmParallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2 Overview CPMD 9 phases Charm applicability Overlap Decomposition Portability Communication Optimization Decomposition State Planes 3d FFT 3d matrix multiply Utilizing Charm++ Prioritized nonlocal Commlib Projections 3 Quantum Chemistry LeanCP Collaboration Glenn Martyna (IBM TJ Watson) Mark Tuckerman (NYU) Nick Nystrom (PSU) PPL: Kale, Shi, Bohm, Pauli, Kumar (now at IBM), Vadali CPMD Method Plane wave QM : 100s of atoms Charm++ Parallelization PINY MD Physics engine 4 CPMD on Charm++ 11 Charm Arrays 4 Charm Modules 13 Charm Groups 3 Commlib strategies BLAS FFTW PINY MD Adaptive Overlap Prioritized computation for phased application Communication optimization Load balancing Group caches Rth Threads 5 Practical Scaling Single Wall Carbon Nanotube Field Effect Transistor BG/L Performance 6 Computation Flow 7 Charm++ Uses the approach of virtualization Divide the work into VPs Typically much more than #proc Schedule each VP for execution Advantage: Computation and communication can be overlapped (between VPs) Number of VPs can be independent of #proc Other: load balancing, checkpointing, etc. 8 Decomposition Higher degree of virtualization better for Charm++ Real Space State Planes, Gspace State Planes, Rho Real and Rho G, S-Calculators for each gspace state plane. Tens of thousands of chares for a 32 mol problem Careful scheduling to maximize efficiency Most of the computation is in FFTs and Matrix Multiplies 9 3-D FFT Implementation Sparse 3-D FFT Dense 3-D FFT 10 Parallel FFT Library Slab-based parallelization We do not re-implement the sequential routine Utilize 1-D and 2-D FFT routines provided by FFTW Allow for Multiple 3-D FFTs simultaneously Multiple data sets within the same set of slab objects Useful as 3-D FFTs are frequently used in CP computations 11 Multiple Parallel 3-D FFTs 12 Matrix Multiply AKA Scalculator or Pair Calculator Decompose state-plane values into smaller objects. Use DGEMM on smaller sub-matrices Sum together via reduction back to Gspace 13 Matrix Multiply VP based approach 14 Charm++ Tricks and Tips Message driven execution and high degree of virtualization present tuning challenges Flow of control using Rth-Threads Prioritized messages Commlib framework Charm++ arrays vs groups Problem identification with projections Problem isolation techniques 15 Flow Control in Parallel Rth Threads Based on Duff's device these are user level threads with negligible overhead. Essentially Goto and Return without readability loss Allow for an event loop style of programming Makes flow of control explicit Uses familiar threading semantic 16 Rth Threads for Flow Control 17 Prioritized Messages for Overlap 18 Communication Library Fine grained decomposition can result in many small messages. Message combining via the Commlib framework in Charm++ addresses this problem. Streaming protocol optimizes many to many personalized. Forwarding protocols like Ring or Multiring can be beneficial. But not on BG/L 19 Commlib Strategy Selection 20 Streaming Commlib Saves time 610ms vs 480ms 21 Bound Arrays Why? Efficiency and clarity of expression. Two arrays of the same dimensionality where like indices are co-placed. Gspace and the non-local computation both have plane based computations and share many data elements. Use ck-local to access elements, like local functions and local function calls. Remain distinct parallel objects 22 Group Caching Techniques Group objects have 1 element per processor Making excellent cache points for arrays which may have many chares per processor Place low volatility data in the group Array elements use cklocal to access In CPMD: the Structure Factor for all chares which have plane P use the same memory 23 Charm++ Performance Debugging Complex parallel applications hard to debug Event based model with high degree of virtualization presents new challenges Projections and Charm++ debugger Tools Bottleneck identification: using the Projections Usage Profile tool 24 Old S->T Orthonormalization 25 After Parallel S->T 26 Problem isolation techniques Using Rth threads its easy to isolate phases by adding a barrier. Contribute to Reduction -> suspend Reduction proxy is broadcast client ->resume In the following example we break up the Gspace IFFT into computation and communication entry methods. We then insert a barrier between them to highlight a specific performance problem 27 Projections Timeline Analysis 28 Optimizations Motivated by BG/L Finer decomposition Structure Factor and non-local computation now operate on groups of atoms within a plane Improved scaling Avoid creating network bottlenecks No DMA or communication offload on BG/L's torus net Workarounds for MPI progress engine Set eager complex doublepack optimization New FFT based algorithm for Structure Factor More systems Topology aware chare mapping HLL Orchestration expression 31 What time is it in Scotland? There is a 1024 node BG/L in Edinburg Time is 6 hours ahead of CT there. During this non production time we can run on the full rack at night Thank you EPCC!