Performance Tuning of LAMMPS Dissipative Particle Dynamics...

Performance Tuning of LAMMPS Dissipative Particle Dynamics

Simulation on Intel MIC

Department of High Performance Computing, CNIC, CASCenter of Scientific Computing Applications & Research, CAS

Shun Xu, Zhong Jin

2018.5.11

Intel® Parallel Computing Centers (IPCC) Asia Summit 2018

Outline

1. Introduction to Dissipative Particle Dynamics(DPD) simulations

2. Performance tuning of LAMMPS DPD

3. Conclusions

Particles i and j interactions in cutoff range:

Relative displacement:

Relative velocity�

Normalized Rij�

Strength coefficient �

mid 2ridt2

= Fi = f C rij( )+ f D rij,vij( )+ f R rij( )!" #$j≠i∑

vij = vj − vi

Conservative Dissipative Random

Introduction to Dissipative Particle Dynamics(DPD)

r̂ij =rijrij

rcrij = rj − ri

f C rij( ) = aijWC rij( ) r̂ijf D rij,vij( ) = −γ ijWD rij( ) r̂ij ⋅ vij( ) r̂ij

f R rij( ) =σ ijWR rij( )ζ ijδtr̂ij

WD rij( ) ≡W 2R rij( )

WR rij( ) =W sC rij( ) = 1−

σ ij2 = 2γ ijkBT

ζ ij =ζ ji ∈ Ν 0,1( )

Repulsive F

Attractive F

The dissipative force

and random force

exists constraints to

satisfy the

Boltzmann weight

distribution.

For any s>=1

Three pairs of forces: the first is a conservative force, the second is the dissipative force, the

third is the random force

A case of LAMMPS DPD simulation of the two phases separation

N=32000Temp=0.25Density=3pair_coeff 1 1 25.0 2.5pair_coeff 1 2 30.0 2.5pair_coeff 2 2 25.0 2.5

�Under LAMMPS without Intel MIC accelerated�pair_coeff i j aij γij [rc ]

The scientific significances in accelerated DPD simulations• To solve the problem of soft matter field of DPD simulation

usually have difficulty in the calculation of a large number of particles

• To observe (macroscopic) properties of DPD system for a long time

• To connect DPD into the multiscale simulations much smoothly

A review paper in 2010�DISSIPATIVE PARTICLE DYNAMICS IN SOFT MATTER AND POLYMERIC APPLICATIONS - A REVIEW; E. Moeendarbary, T.Y. Ng, M. Zangeneh, Int. J. Appl. Mech. 2 (2010) 161–190

mentioned that "The DPD is one the most reliable mesoscopic simulation techniques for phenomenological investigation of soft matter and polymeric systems.”

[1] P.J. Hoogerbrugge, J.M.V.A. Koelman, Europhys. Lett. 19 (3) (1992) 155–160. [2] P. Español, P.B. Warren, Europhys. Lett. 30 (4) (1995) 191.[3] R.D. Groot, P.B. Warren, J. Chem. Phys. 107 (11) (1997) 4423–4435.

Outline

3. Conclusions

Intel Xeon Phi Accelerating MD

• LAMMPSProduct level MD software

• MiniMDLightweight version for performance testing

SIMD optimized

1. LAMMPS Intel Xeon (phi) USER_INTEL package (Intel KNC/KNL)2. NAMD 2015-12-22 Linux-x86_64-multicore-MIC (Intel Xeon Phi coprocessor acceleration)3. GROMACS 5.0-RC with Intel Xeon Phi coprocessor native/symmetric support (plan for support Offload

Integrate DPD code into LAMMPS USER-INTEL package

About USER-INTEL packageUSER-INTEL LAMMPS plug-in package of the framework code maintained by W. Michael Brown from INTEL and Kurpad Anupama, mainly based on the previously developed USER-OMP plug-in package:Main features: • support for three kinds of precisions: single, double and mixed• key function in vector optimization• support Intel Xeon Phi KNL and KNC in offload mode

Using suffix by order�intel, ompTurn on offload by defining macro variable -DLMP_INTEL_OFFLOAD Supporting thread affinity setting by defining -DINTEL_OFFLOAD_NOAFFINITY

Compile: locate LAMMPS src directory, then make yes-USER-INTEL && make intel_phi

fix_intel.cpp/.h; basic function for MIC interactionsintel_buffers.cpp/.h; Buffer management between HOST and MIC device (modified)Intel_intrinsics.h; routines for AVX-512 and AVX2verlet_intel.cpp/.h; verlet integration on Intel pair_xxx_intel.cpp/.h; xxx potential on MIC (KNC or KNL)

Several corecode files

LAMMPS simulation in KNC offload mode

input.in output.log

neighbor list short-range terms

update F, v, x

MPI task rank=0 MPI task rank=n

utiliz

each subdomain maps to MPI task

Each CPU MPI task can launch several MIC threads to calculations

LAMMPS in MPI-OpenMP vs. MIC-offload mode

OpenMP threads portioned between CPU and MIC devices

MPI taskrank=0

MPI taskrank=n

Thread0

Threadm

2, Advanced MPI + Host & MIC offload OpenMP threads

1, Normal MPI + Host OpenMP threads

Initial setting in PairDPDIntel class

void PairDPDIntel::init_style(){//…

int ifix = modify->find_fix("package_intel");if (fix->precision() == FixIntel::PREC_MODE_MIXED)

pack_force_const(force_const_single, fix->get_mixed_buffers());else if (fix->precision() == FixIntel::PREC_MODE_DOUBLE)

pack_force_const(force_const_double, fix->get_double_buffers());else

pack_force_const(force_const_single, fix->get_single_buffers());}

At the beginning of calculation, PairDPDIntel calls init_style()�to get buffer variable of IntelBuffers<flt_t, acc_t> buffer�

buffer created in different precisions.

PairDPDIntel ::compute<flt_t,acc_t> function

if (eflag) {if (force->newton_pair) {

eval<1, 1, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 1>(0, ovflag, buffers, fc, host_start, inum);

} else {eval<1, 1, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 1, 0>(0, ovflag, buffers, fc, host_start, inum);

}} else {

if (force->newton_pair) {eval<1, 0, 1>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 1>(0, ovflag, buffers, fc, host_start, inum);

} else {eval<1, 0, 0>(1, ovflag, buffers, fc, 0, offload_end);eval<1, 0, 0>(0, ovflag, buffers, fc, host_start, inum);

For MIC offload

For Host CPU

Load balance setting:

Vectorization calculation

• Vector calculation in both sides�1. MIC thread: vector bit width 512 bits.2. Host CPU: AVX bit width 512 bits

• data alignment optimization1. The 64 bit variable memory space alignment2. The data structure of atom combined 4 floating-point numbers, which

space size in bytes can be divided by 5123. Using advanced SIMD directive, such as #pragma SIMD reduction (.) 4. Using mixed precision (both of single and double float) trade-off

between speed and accuracy.

PairDPDIntel::eval()

Highlight to random number usage in DPD potential calculation:

PairDPDIntel::eval()

Highlight to simd reduction in DPD potential calculation:

Integration of source codes for KNC and KNLpair_dpd_offload_intel.cpppair_dpd_offload_intel.h

pair_dpd_intel.cpppair_dpd_intel.h

• Use macro variable: LMP_INTEL_OFFLOADto separate KNC and KNL/CPU codes

• Use thread parallelism

Architecture x86_64

CPU op-mode(s) 32-bit, 64-bit

Byte Order Little EndianCPU(s) 256On-line CPU(s) list 0-255Thread(s) per core 4Core(s) per socket 64Socket(s) 1NUMA node(s) 1

Vendor ID GenuineIntel

CPU family 6Model 87

Model name Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz

Stepping 1CPU MHz 1182.289BogoMIPS 2599.92Virtualization VT-xL1d cache 32KL1i cache 32KL2 cache 1024KNUMA node0 CPU(s) 0-255

Intel® Xeon Phi™ Processor 7210

Intel KNL for test

All 16GB MCDRAM used as cache memory

OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \

-DLMP_INTEL_USELRT $(OPTFLAGS)

CC = mpiicpc

OPTFLAGS = -xMIC-AVX512 -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \

-DLMP_INTEL_USELRT -DLMP_USE_MKL_RNG $(OPTFLAGS)

OPTFLAGS = -O2 -fp-model fast=2 -no-prec-div -qoverride-limitsCCFLAGS = -qopenmp -qno-offload -fno-alias -ansi-alias -restrict \

-DLMP_INTEL_USELRT $(OPTFLAGS)

KNL�

KNL_AVX512�

KNL_AVX512_MKL�

export OMP_NUM_THREADS=$threadsmpirun -np 64 lmp_intel_knl -in in.intel.dpd -log dpd.64c4t.log \

-pk intel 0 -sf intel -screen none -v d 1

Run for 1, 2, 4 threads per core

(64 cores with) N threads/core

LAMMPS DPD on Intel KNL, 512000 atoms * 4000 stepsKNL KNL_AVX512 KNL_AVX512_MKL

KNL�

KNL_AVX512_MKL�

Intel® Trace Analyzer for MPI behavior

Outline

3. Conclusions

Conclusions

• LAMMPS DPD optimization for Intel platform is highlighted.• To promote the applications of LAMMPS DPD.

Acknowledgements

• LAMMPS DPD module inside USER-INTEL package is initially developed from the CAS-IPCC project.

• To the support of W. Michael Brown from Intel.

Thank you for your attention!

Performance Tuning of LAMMPS Dissipative Particle Dynamics...

Documents

Transcript of Performance Tuning of LAMMPS Dissipative Particle Dynamics...

LAMMPS Documentationafrl.hpc.mil/software/info/lammps/Manual.pdf · Previous Section − LAMMPS WWW Site − LAMMPS Documentation − LAMMPS Commands − Next Section 1. Introduction

LAMMPS Software Development on GitHub€¦ · LAMMPS Software Development on GitHub Richard Berger Anders Hafreager LAMMPS Workshop 2017 August 1, 2017. LAMMPS GitHub Tutorial Target

Porting LAMMPS to GPUs · Extending LAMMPS via Styles In hindsight, this is best feature of LAMMPS 80% of code is “extensions” via styles only 35K of 175K lines is core of LAMMPS

LAMMPS Manual from sandia.gov

Granular Simulations in LAMMPS

Resources for learning LAMMPS

Lammps Tutorial Oct06

LAMMPS Users Manual - Xiamen Universitycmp.xmu.edu.cn/lammps/Manual.pdf · LAMMPS Users Manual i. ... fix enforce2d command

LAMMPS for Beginners

Peridynamic theory of solids from the ... - lammps.sandia.gov · LAMMPS and Peridynamics: PDLAMMPS To use PDLAMMPS: make yes-peri then build LAMMPS. For multiscale modeling link LAMMPS

Resources for learning LAMMPS - LAMMPS Molecular …lammps.sandia.gov/tutorials/sor13/SoR_03-Tour_of_LAM… · · 2017-02-07Resources for learning LAMMPS Examples: about 35 sub-dirs

LAMMPS, FLASH and MESA

Granular Lammps

Manual Lammps

ReaxFF LAMMPS 2010lammps.sandia.gov/.../ReaxFF_LAMMPS_2010.pdf · ReaxFF in LAMMPS New LAMMPS features briefs LAMMPS Users’ Workshop @ CSRI Thursday, Feb 25, 2010, 11:15 a.m. Aidan

Lammps Users Manual

LAMMPS Documentation · 4 Run LAMMPS 71 4.1 Basics of running LAMMPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 4.2 Command-line options ...

SPH LAMMPS Userguide

LAMMPS Manual

Modifying & Extending LAMMPS...Modifying & Extending LAMMPS Steve Plimpton Sandia National Labs sjplimp@sandia.gov LAMMPS Users and Developers Workshop International Centre for Theoretical