Performance Tuning of LAMMPS Dissipative Particle Dynamics...

Performance Tuning of LAMMPS Dissipative Particle Dynamics

Simulation on Intel MIC

Department of High Performance Computing, CNIC, CASCenter of Scientific Computing Applications & Research, CAS

Shun Xu, Zhong Jin

2018.5.11

Intel® Parallel Computing Centers (IPCC) Asia Summit 2018

Outline

1. Introduction to Dissipative Particle Dynamics(DPD) simulations

2. Performance tuning of LAMMPS DPD

3. Conclusions

Particles i and j interactions in cutoff range:

Relative displacement:

Relative velocity�

Normalized Rij�

Strength coefficient �

mid 2ridt2

= Fi = f C rij( )+ f D rij,vij( )+ f R rij( )!" #$j≠i∑

vij = vj − vi

Conservative Dissipative Random

Introduction to Dissipative Particle Dynamics(DPD)

r̂ij =rijrij

rcrij = rj − ri

f C rij( ) = aijWC rij( ) r̂ijf D rij,vij( ) = −γ ijWD rij( ) r̂ij ⋅ vij( ) r̂ij

f R rij( ) =σ ijWR rij( )ζ ijδtr̂ij

aij

WD rij( ) ≡W 2R rij( )

WR rij( ) =W sC rij( ) = 1−

rijrc

"

#$$

%

&''

s

σ ij2 = 2γ ijkBT

ζ ij =ζ ji ∈ Ν 0,1( )

Repulsive F

Attractive F

Attractive F

The dissipative force

and random force

exists constraints to

satisfy the

Boltzmann weight

distribution.

For any s>=1

Three pairs of forces: the first is a conservative force, the second is the dissipative force, the

third is the random force

A case of LAMMPS DPD simulation of the two phases separation

N=32000Temp=0.25Density=3pair_coeff 1 1 25.0 2.5pair_coeff 1 2 30.0 2.5pair_coeff 2 2 25.0 2.5

�Under LAMMPS without Intel MIC accelerated�pair_coeff i j aij γij [rc ]

The scientific significances in accelerated DPD simulations• To solve the problem of soft matter field of DPD simulation

usually have difficulty in the calculation of a large number of particles

• To observe (macroscopic) properties of DPD system for a long time

• To connect DPD into the multiscale simulations much smoothly

A review paper in 2010�DISSIPATIVE PARTICLE DYNAMICS IN SOFT MATTER AND POLYMERIC APPLICATIONS - A REVIEW; E. Moeendarbary, T.Y. Ng, M. Zangeneh, Int. J. Appl. Mech. 2 (2010) 161–190

mentioned that "The DPD is one the most reliable mesoscopic simulation techniques for phenomenological investigation of soft matter and polymeric systems.”

[1] P.J. Hoogerbrugge, J.M.V.A. Koelman, Europhys. Lett. 19 (3) (1992) 155–160. [2] P. Español, P.B. Warren, Europhys. Lett. 30 (4) (1995) 191.[3] R.D. Groot, P.B. Warren, J. Chem. Phys. 107 (11) (1997) 4423–4435.

Outline



3. Conclusions

Intel Xeon Phi Accelerating MD

• LAMMPSProduct level MD software

• MiniMDLightweight version for performance testing

SIMD optimized

1. LAMMPS Intel Xeon (phi) USER_INTEL package (Intel KNC/KNL)2. NAMD 2015-12-22 Linux-x86_64-multicore-MIC (Intel Xeon Phi coprocessor acceleration)3. GROMACS 5.0-RC with Intel Xeon Phi coprocessor native/symmetric support (plan for support Offload

mode)

https://software.intel.com/en-us/articles/lammps-for-intel-xeon-phi-coprocessor

https://software.intel.com/en-us/articles/gromacs-for-intel-xeon-phi-coprocessor?language=fr

Integrate DPD code into LAMMPS USER-INTEL package

About USER-INTEL packageUSER-INTEL LAMMPS plug-in package of the framework code maintained by W. Michael Brown from INTEL and Kurpad Anupama, mainly based on the previously developed USER-OMP plug-in package:Main features: • support for three kinds of precisions: single, double and mixed• key function in vector optimization• support Intel Xeon Phi KNL and KNC in offload mode

Using suffix by order�intel, ompTurn on offload by defining macro variable -DLMP_INTEL_OFFLOAD Supporting thread affinity setting by defining -DINTEL_OFFLOAD_NOAFFINITY

Compile: locate LAMMPS src directory, then make yes-USER-INTEL && make intel_phi

fix_intel.cpp/.h; basic function for MIC interactionsintel_buffers.cpp/.h; Buffer management between HOST and MIC device (modified)Intel_intrinsics.h; routines for AVX-512 and AVX2verlet_intel.cpp/.h; verlet integration on Intel pair_xxx_intel.cpp/.h; xxx potential on MIC (KNC or KNL)

Several corecode files

LAMMPS simulation in KNC offload mode

CPU

MIC

input.in output.log

neighbor list short-range terms

update F, v, x

MPI task rank=0 MPI task rank=n

MPI

task

utiliz

es se

vera

l MIC

thre

ads

for o

ffloa

d tas

ks

each subdomain maps to MPI task

...

Each CPU MPI task can launch several MIC threads to calculations

LAMMPS in MPI-OpenMP vs. MIC-offload mode

OpenMP threads portioned between CPU and MIC devices

MPI taskrank=0

MPI taskrank=n

Thread0

Threadm

2, Advanced MPI + Host & MIC offload OpenMP threads

1, Normal MPI + Host OpenMP threads

Initial setting in PairDPDIntel class

void PairDPDIntel::init_style(){//…

int ifix = modify->find_fix("package_intel");if (fix->precision() == FixIntel::PREC_MODE_MIXED)

pack_force_const(force_const_single, fix->get_mixed_buffers());else if (fix->precision() == FixIntel::PREC_MODE_DOUBLE)

pack_force_const(force_const_double, fix->get_double_buffers());else

pack_force_const(force_const_single, fix->get_single_buffers());}

At the beginning of calculation, PairDPDIntel calls init_style()�to get buffer variable of IntelBuffers<flt_t, acc_t> buffer�

buffer created in different precisions.

PairDPDIntel ::compute<flt_t,acc_t> function

if (eflag) {if (force->newton_pair) {

eval<1, 1, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 1>(0, ovflag, buffers, fc, host_start, inum);

} else {eval<1, 1, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 0>(0, ovflag, buffers, fc, host_start, inum);

}} else {

if (force->newton_pair) {eval<1, 0, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 1>(0, ovflag, buffers, fc, host_start, inum);

} else {eval<1, 0, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 0>(0, ovflag, buffers, fc, host_start, inum);

}}

For MIC offload

For Host CPU

Load balance setting:

Vectorization calculation

• Vector calculation in both sides�1. MIC thread: vector bit width 512 bits.2. Host CPU: AVX bit width 512 bits

• data alignment optimization1. The 64 bit variable memory space alignment2. The data structure of atom combined 4 floating-point numbers, which

space size in bytes can be divided by 5123. Using advanced SIMD directive, such as #pragma SIMD reduction (.) 4. Using mixed precision (both of single and double float) trade-off

between speed and accuracy.

PairDPDIntel::eval()

Highlight to random number usage in DPD potential calculation:

PairDPDIntel::eval()

Highlight to simd reduction in DPD potential calculation:

Integration of source codes for KNC and KNLpair_dpd_offload_intel.cpppair_dpd_offload_intel.h

pair_dpd_intel.cpppair_dpd_intel.h

• Use macro variable: LMP_INTEL_OFFLOADto separate KNC and KNL/CPU codes

• Use thread parallelism

Architecture x86_64

CPU op-mode(s) 32-bit, 64-bit

Byte Order Little EndianCPU(s) 256On-line CPU(s) list 0-255Thread(s) per core 4Core(s) per socket 64Socket(s) 1NUMA node(s) 1

Vendor ID GenuineIntel

CPU family 6Model 87

Model name Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz

Stepping 1CPU MHz 1182.289BogoMIPS 2599.92Virtualization VT-xL1d cache 32KL1i cache 32KL2 cache 1024KNUMA node0 CPU(s) 0-255

Intel® Xeon Phi™ Processor 7210

Intel KNL for test

All 16GB MCDRAM used as cache memory

OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \

-DLMP_INTEL_USELRT $(OPTFLAGS)

CC = mpiicpc

OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \

-DLMP_INTEL_USELRT -DLMP_USE_MKL_RNG $(OPTFLAGS)

OPTFLAGS = -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \

-DLMP_INTEL_USELRT $(OPTFLAGS)

KNL�

KNL_AVX512�

KNL_AVX512_MKL�

export OMP_NUM_THREADS=$threadsmpirun -np 64 lmp_intel_knl -in in.intel.dpd -log dpd.64c4t.log \

-pk intel 0 -sf intel -screen none -v d 1

Run for 1, 2, 4 threads per core

0

10

20

30

40

50

60

70

80

90

1 2 4

Tim

este

ps/s

ec.

(64 cores with) N threads/core

LAMMPS DPD on Intel KNL, 512000 atoms * 4000 stepsKNL KNL_AVX512 KNL_AVX512_MKL

2.89X

1.61X

1 X

KNL�

KNL_AVX512_MKL�

Intel® Trace Analyzer for MPI behavior

Outline



3. Conclusions

Conclusions

• LAMMPS DPD optimization for Intel platform is highlighted.• To promote the applications of LAMMPS DPD.

Acknowledgements

• LAMMPS DPD module inside USER-INTEL package is initially developed from the CAS-IPCC project.

• To the support of W. Michael Brown from Intel.

Thank you for your attention!

Performance Tuning of LAMMPS Dissipative Particle Dynamics...

Documents

Transcript of Performance Tuning of LAMMPS Dissipative Particle Dynamics...