Performance Tuning of LAMMPS Dissipative Particle Dynamics...
Transcript of Performance Tuning of LAMMPS Dissipative Particle Dynamics...
Performance Tuning of LAMMPS Dissipative Particle Dynamics
Simulation on Intel MIC
Department of High Performance Computing, CNIC, CASCenter of Scientific Computing Applications & Research, CAS
Shun Xu, Zhong Jin
2018.5.11
Intel® Parallel Computing Centers (IPCC) Asia Summit 2018
Outline
1. Introduction to Dissipative Particle Dynamics(DPD) simulations
2. Performance tuning of LAMMPS DPD
3. Conclusions
Particles i and j interactions in cutoff range:
Relative displacement:
Relative velocity�
Normalized Rij�
Strength coefficient �
mid 2ridt2
= Fi = f C rij( )+ f D rij,vij( )+ f R rij( )!" #$j≠i∑
vij = vj − vi
Conservative Dissipative Random
Introduction to Dissipative Particle Dynamics(DPD)
r̂ij =rijrij
rcrij = rj − ri
f C rij( ) = aijWC rij( ) r̂ijf D rij,vij( ) = −γ ijWD rij( ) r̂ij ⋅ vij( ) r̂ij
f R rij( ) =σ ijWR rij( )ζ ijδtr̂ij
aij
WD rij( ) ≡W 2R rij( )
WR rij( ) =W sC rij( ) = 1−
rijrc
"
#$$
%
&''
s
σ ij2 = 2γ ijkBT
ζ ij =ζ ji ∈ Ν 0,1( )
Repulsive F
Attractive F
Attractive F
The dissipative force
and random force
exists constraints to
satisfy the
Boltzmann weight
distribution.
For any s>=1
Three pairs of forces: the first is a conservative force, the second is the dissipative force, the
third is the random force
A case of LAMMPS DPD simulation of the two phases separation
N=32000Temp=0.25Density=3pair_coeff 1 1 25.0 2.5pair_coeff 1 2 30.0 2.5pair_coeff 2 2 25.0 2.5
�Under LAMMPS without Intel MIC accelerated�pair_coeff i j aij γij [rc ]
The scientific significances in accelerated DPD simulations• To solve the problem of soft matter field of DPD simulation
usually have difficulty in the calculation of a large number of particles
• To observe (macroscopic) properties of DPD system for a long time
• To connect DPD into the multiscale simulations much smoothly
A review paper in 2010�DISSIPATIVE PARTICLE DYNAMICS IN SOFT MATTER AND POLYMERIC APPLICATIONS - A REVIEW; E. Moeendarbary, T.Y. Ng, M. Zangeneh, Int. J. Appl. Mech. 2 (2010) 161–190
mentioned that "The DPD is one the most reliable mesoscopic simulation techniques for phenomenological investigation of soft matter and polymeric systems.”
[1] P.J. Hoogerbrugge, J.M.V.A. Koelman, Europhys. Lett. 19 (3) (1992) 155–160. [2] P. Español, P.B. Warren, Europhys. Lett. 30 (4) (1995) 191.[3] R.D. Groot, P.B. Warren, J. Chem. Phys. 107 (11) (1997) 4423–4435.
Outline
1. Introduction to Dissipative Particle Dynamics(DPD) simulations
2. Performance tuning of LAMMPS DPD
3. Conclusions
Intel Xeon Phi Accelerating MD
• LAMMPSProduct level MD software
• MiniMDLightweight version for performance testing
SIMD optimized
1. LAMMPS Intel Xeon (phi) USER_INTEL package (Intel KNC/KNL)2. NAMD 2015-12-22 Linux-x86_64-multicore-MIC (Intel Xeon Phi coprocessor acceleration)3. GROMACS 5.0-RC with Intel Xeon Phi coprocessor native/symmetric support (plan for support Offload
mode)
Integrate DPD code into LAMMPS USER-INTEL package
About USER-INTEL packageUSER-INTEL LAMMPS plug-in package of the framework code maintained by W. Michael Brown from INTEL and Kurpad Anupama, mainly based on the previously developed USER-OMP plug-in package:Main features: • support for three kinds of precisions: single, double and mixed• key function in vector optimization• support Intel Xeon Phi KNL and KNC in offload mode
Using suffix by order�intel, ompTurn on offload by defining macro variable -DLMP_INTEL_OFFLOAD Supporting thread affinity setting by defining -DINTEL_OFFLOAD_NOAFFINITY
Compile: locate LAMMPS src directory, then make yes-USER-INTEL && make intel_phi
fix_intel.cpp/.h; basic function for MIC interactionsintel_buffers.cpp/.h; Buffer management between HOST and MIC device (modified)Intel_intrinsics.h; routines for AVX-512 and AVX2verlet_intel.cpp/.h; verlet integration on Intel pair_xxx_intel.cpp/.h; xxx potential on MIC (KNC or KNL)
Several corecode files
LAMMPS simulation in KNC offload mode
CPU
MIC
input.in output.log
neighbor list short-range terms
update F, v, x
MPI task rank=0 MPI task rank=n
MPI
task
utiliz
es se
vera
l MIC
thre
ads
for o
ffloa
d tas
ks
each subdomain maps to MPI task
...
Each CPU MPI task can launch several MIC threads to calculations
LAMMPS in MPI-OpenMP vs. MIC-offload mode
OpenMP threads portioned between CPU and MIC devices
MPI taskrank=0
MPI taskrank=n
Thread0
Threadm
2, Advanced MPI + Host & MIC offload OpenMP threads
1, Normal MPI + Host OpenMP threads
Initial setting in PairDPDIntel class
void PairDPDIntel::init_style(){//…
int ifix = modify->find_fix("package_intel");if (fix->precision() == FixIntel::PREC_MODE_MIXED)
pack_force_const(force_const_single, fix->get_mixed_buffers());else if (fix->precision() == FixIntel::PREC_MODE_DOUBLE)
pack_force_const(force_const_double, fix->get_double_buffers());else
pack_force_const(force_const_single, fix->get_single_buffers());}
At the beginning of calculation, PairDPDIntel calls init_style()�to get buffer variable of IntelBuffers<flt_t, acc_t> buffer�
buffer created in different precisions.
PairDPDIntel ::compute<flt_t,acc_t> function
if (eflag) {if (force->newton_pair) {
eval<1, 1, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 1>(0, ovflag, buffers, fc, host_start, inum);
} else {eval<1, 1, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 0>(0, ovflag, buffers, fc, host_start, inum);
}} else {
if (force->newton_pair) {eval<1, 0, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 1>(0, ovflag, buffers, fc, host_start, inum);
} else {eval<1, 0, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 0>(0, ovflag, buffers, fc, host_start, inum);
}}
For MIC offload
For Host CPU
Load balance setting:
Vectorization calculation
• Vector calculation in both sides�1. MIC thread: vector bit width 512 bits.2. Host CPU: AVX bit width 512 bits
• data alignment optimization1. The 64 bit variable memory space alignment2. The data structure of atom combined 4 floating-point numbers, which
space size in bytes can be divided by 5123. Using advanced SIMD directive, such as #pragma SIMD reduction (.) 4. Using mixed precision (both of single and double float) trade-off
between speed and accuracy.
PairDPDIntel::eval()
Highlight to random number usage in DPD potential calculation:
PairDPDIntel::eval()
Highlight to simd reduction in DPD potential calculation:
Integration of source codes for KNC and KNLpair_dpd_offload_intel.cpppair_dpd_offload_intel.h
pair_dpd_intel.cpppair_dpd_intel.h
• Use macro variable: LMP_INTEL_OFFLOADto separate KNC and KNL/CPU codes
• Use thread parallelism
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little EndianCPU(s) 256On-line CPU(s) list 0-255Thread(s) per core 4Core(s) per socket 64Socket(s) 1NUMA node(s) 1
Vendor ID GenuineIntel
CPU family 6Model 87
Model name Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
Stepping 1CPU MHz 1182.289BogoMIPS 2599.92Virtualization VT-xL1d cache 32KL1i cache 32KL2 cache 1024KNUMA node0 CPU(s) 0-255
Intel® Xeon Phi™ Processor 7210
Intel KNL for test
All 16GB MCDRAM used as cache memory
OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \
-DLMP_INTEL_USELRT $(OPTFLAGS)
CC = mpiicpc
OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \
-DLMP_INTEL_USELRT -DLMP_USE_MKL_RNG $(OPTFLAGS)
OPTFLAGS = -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \
-DLMP_INTEL_USELRT $(OPTFLAGS)
KNL�
KNL_AVX512�
KNL_AVX512_MKL�
export OMP_NUM_THREADS=$threadsmpirun -np 64 lmp_intel_knl -in in.intel.dpd -log dpd.64c4t.log \
-pk intel 0 -sf intel -screen none -v d 1
Run for 1, 2, 4 threads per core
0
10
20
30
40
50
60
70
80
90
1 2 4
Tim
este
ps/s
ec.
(64 cores with) N threads/core
LAMMPS DPD on Intel KNL, 512000 atoms * 4000 stepsKNL KNL_AVX512 KNL_AVX512_MKL
2.89X
1.61X
1 X
KNL�
KNL_AVX512_MKL�
Intel® Trace Analyzer for MPI behavior
Outline
1. Introduction to Dissipative Particle Dynamics(DPD) simulations
2. Performance tuning of LAMMPS DPD
3. Conclusions
Conclusions
• LAMMPS DPD optimization for Intel platform is highlighted.• To promote the applications of LAMMPS DPD.
Acknowledgements
• LAMMPS DPD module inside USER-INTEL package is initially developed from the CAS-IPCC project.
• To the support of W. Michael Brown from Intel.
Thank you for your attention!