Post on 10-Jul-2020
Performance Tuning of LAMMPS Dissipative Particle Dynamics
Simulation on Intel MIC
Department of High Performance Computing, CNIC, CASCenter of Scientific Computing Applications & Research, CAS
Shun Xu, Zhong Jin
2018.5.11
Intel® Parallel Computing Centers (IPCC) Asia Summit 2018
Outline
1. Introduction to Dissipative Particle Dynamics(DPD) simulations
2. Performance tuning of LAMMPS DPD
3. Conclusions
Particles i and j interactions in cutoff range:
Relative displacement:
Relative velocity�
Normalized Rij�
Strength coefficient �
mid 2ridt2
= Fi = f C rij( )+ f D rij,vij( )+ f R rij( )!" #$j≠i∑
vij = vj − vi
Conservative Dissipative Random
Introduction to Dissipative Particle Dynamics(DPD)
r̂ij =rijrij
rcrij = rj − ri
f C rij( ) = aijWC rij( ) r̂ijf D rij,vij( ) = −γ ijWD rij( ) r̂ij ⋅ vij( ) r̂ij
f R rij( ) =σ ijWR rij( )ζ ijδtr̂ij
aij
WD rij( ) ≡W 2R rij( )
WR rij( ) =W sC rij( ) = 1−
rijrc
"
#$$
%
&''
s
σ ij2 = 2γ ijkBT
ζ ij =ζ ji ∈ Ν 0,1( )
Repulsive F
Attractive F
Attractive F
The dissipative force
and random force
exists constraints to
satisfy the
Boltzmann weight
distribution.
For any s>=1
Three pairs of forces: the first is a conservative force, the second is the dissipative force, the
third is the random force
A case of LAMMPS DPD simulation of the two phases separation
N=32000Temp=0.25Density=3pair_coeff 1 1 25.0 2.5pair_coeff 1 2 30.0 2.5pair_coeff 2 2 25.0 2.5
�Under LAMMPS without Intel MIC accelerated�pair_coeff i j aij γij [rc ]
The scientific significances in accelerated DPD simulations• To solve the problem of soft matter field of DPD simulation
usually have difficulty in the calculation of a large number of particles
• To observe (macroscopic) properties of DPD system for a long time
• To connect DPD into the multiscale simulations much smoothly
A review paper in 2010�DISSIPATIVE PARTICLE DYNAMICS IN SOFT MATTER AND POLYMERIC APPLICATIONS - A REVIEW; E. Moeendarbary, T.Y. Ng, M. Zangeneh, Int. J. Appl. Mech. 2 (2010) 161–190
mentioned that "The DPD is one the most reliable mesoscopic simulation techniques for phenomenological investigation of soft matter and polymeric systems.”
[1] P.J. Hoogerbrugge, J.M.V.A. Koelman, Europhys. Lett. 19 (3) (1992) 155–160. [2] P. Español, P.B. Warren, Europhys. Lett. 30 (4) (1995) 191.[3] R.D. Groot, P.B. Warren, J. Chem. Phys. 107 (11) (1997) 4423–4435.
Outline
1. Introduction to Dissipative Particle Dynamics(DPD) simulations
2. Performance tuning of LAMMPS DPD
3. Conclusions
Intel Xeon Phi Accelerating MD
• LAMMPSProduct level MD software
• MiniMDLightweight version for performance testing
SIMD optimized
1. LAMMPS Intel Xeon (phi) USER_INTEL package (Intel KNC/KNL)2. NAMD 2015-12-22 Linux-x86_64-multicore-MIC (Intel Xeon Phi coprocessor acceleration)3. GROMACS 5.0-RC with Intel Xeon Phi coprocessor native/symmetric support (plan for support Offload
mode)
Integrate DPD code into LAMMPS USER-INTEL package
About USER-INTEL packageUSER-INTEL LAMMPS plug-in package of the framework code maintained by W. Michael Brown from INTEL and Kurpad Anupama, mainly based on the previously developed USER-OMP plug-in package:Main features: • support for three kinds of precisions: single, double and mixed• key function in vector optimization• support Intel Xeon Phi KNL and KNC in offload mode
Using suffix by order�intel, ompTurn on offload by defining macro variable -DLMP_INTEL_OFFLOAD Supporting thread affinity setting by defining -DINTEL_OFFLOAD_NOAFFINITY
Compile: locate LAMMPS src directory, then make yes-USER-INTEL && make intel_phi
fix_intel.cpp/.h; basic function for MIC interactionsintel_buffers.cpp/.h; Buffer management between HOST and MIC device (modified)Intel_intrinsics.h; routines for AVX-512 and AVX2verlet_intel.cpp/.h; verlet integration on Intel pair_xxx_intel.cpp/.h; xxx potential on MIC (KNC or KNL)
Several corecode files
LAMMPS simulation in KNC offload mode
CPU
MIC
input.in output.log
neighbor list short-range terms
update F, v, x
MPI task rank=0 MPI task rank=n
MPI
task
utiliz
es se
vera
l MIC
thre
ads
for o
ffloa
d tas
ks
each subdomain maps to MPI task
...
Each CPU MPI task can launch several MIC threads to calculations
LAMMPS in MPI-OpenMP vs. MIC-offload mode
OpenMP threads portioned between CPU and MIC devices
MPI taskrank=0
MPI taskrank=n
Thread0
Threadm
2, Advanced MPI + Host & MIC offload OpenMP threads
1, Normal MPI + Host OpenMP threads
Initial setting in PairDPDIntel class
void PairDPDIntel::init_style(){//…
int ifix = modify->find_fix("package_intel");if (fix->precision() == FixIntel::PREC_MODE_MIXED)
pack_force_const(force_const_single, fix->get_mixed_buffers());else if (fix->precision() == FixIntel::PREC_MODE_DOUBLE)
pack_force_const(force_const_double, fix->get_double_buffers());else
pack_force_const(force_const_single, fix->get_single_buffers());}
At the beginning of calculation, PairDPDIntel calls init_style()�to get buffer variable of IntelBuffers<flt_t, acc_t> buffer�
buffer created in different precisions.
PairDPDIntel ::compute<flt_t,acc_t> function
if (eflag) {if (force->newton_pair) {
eval<1, 1, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 1>(0, ovflag, buffers, fc, host_start, inum);
} else {eval<1, 1, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 0>(0, ovflag, buffers, fc, host_start, inum);
}} else {
if (force->newton_pair) {eval<1, 0, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 1>(0, ovflag, buffers, fc, host_start, inum);
} else {eval<1, 0, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 0>(0, ovflag, buffers, fc, host_start, inum);
}}
For MIC offload
For Host CPU
Load balance setting:
Vectorization calculation
• Vector calculation in both sides�1. MIC thread: vector bit width 512 bits.2. Host CPU: AVX bit width 512 bits
• data alignment optimization1. The 64 bit variable memory space alignment2. The data structure of atom combined 4 floating-point numbers, which
space size in bytes can be divided by 5123. Using advanced SIMD directive, such as #pragma SIMD reduction (.) 4. Using mixed precision (both of single and double float) trade-off
between speed and accuracy.
PairDPDIntel::eval()
Highlight to random number usage in DPD potential calculation:
PairDPDIntel::eval()
Highlight to simd reduction in DPD potential calculation:
Integration of source codes for KNC and KNLpair_dpd_offload_intel.cpppair_dpd_offload_intel.h
pair_dpd_intel.cpppair_dpd_intel.h
• Use macro variable: LMP_INTEL_OFFLOADto separate KNC and KNL/CPU codes
• Use thread parallelism
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little EndianCPU(s) 256On-line CPU(s) list 0-255Thread(s) per core 4Core(s) per socket 64Socket(s) 1NUMA node(s) 1
Vendor ID GenuineIntel
CPU family 6Model 87
Model name Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
Stepping 1CPU MHz 1182.289BogoMIPS 2599.92Virtualization VT-xL1d cache 32KL1i cache 32KL2 cache 1024KNUMA node0 CPU(s) 0-255
Intel® Xeon Phi™ Processor 7210
Intel KNL for test
All 16GB MCDRAM used as cache memory
OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \
-DLMP_INTEL_USELRT $(OPTFLAGS)
CC = mpiicpc
OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \
-DLMP_INTEL_USELRT -DLMP_USE_MKL_RNG $(OPTFLAGS)
OPTFLAGS = -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \
-DLMP_INTEL_USELRT $(OPTFLAGS)
KNL�
KNL_AVX512�
KNL_AVX512_MKL�
export OMP_NUM_THREADS=$threadsmpirun -np 64 lmp_intel_knl -in in.intel.dpd -log dpd.64c4t.log \
-pk intel 0 -sf intel -screen none -v d 1
Run for 1, 2, 4 threads per core
0
10
20
30
40
50
60
70
80
90
1 2 4
Tim
este
ps/s
ec.
(64 cores with) N threads/core
LAMMPS DPD on Intel KNL, 512000 atoms * 4000 stepsKNL KNL_AVX512 KNL_AVX512_MKL
2.89X
1.61X
1 X
KNL�
KNL_AVX512_MKL�
Intel® Trace Analyzer for MPI behavior
Outline
1. Introduction to Dissipative Particle Dynamics(DPD) simulations
2. Performance tuning of LAMMPS DPD
3. Conclusions
Conclusions
• LAMMPS DPD optimization for Intel platform is highlighted.• To promote the applications of LAMMPS DPD.
Acknowledgements
• LAMMPS DPD module inside USER-INTEL package is initially developed from the CAS-IPCC project.
• To the support of W. Michael Brown from Intel.
Thank you for your attention!