HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster...

55
HPC Applications Performance and Optimizations Best Practices Pak Lui

Transcript of HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster...

Page 1: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

HPC Applications Performance and Optimizations Best Practices

Pak Lui

Page 2: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

2

• LS-DYNA

• miniFE

• MILC

• MSC Nastran

• MR Bayes

• MM5

• MPQC

• NAMD

• Nekbone

• NEMO

• NWChem

• Octopus

• OpenAtom

• OpenFOAM

• MILC

• OpenMX

• PARATEC

• PFA

• PFLOTRAN

• Quantum ESPRESSO

• RADIOSS

• SPECFEM3D

• WRF

130 Applications Best Practices Published

• Abaqus

• AcuSolve

• Amber

• AMG

• AMR

• ABySS

• ANSYS CFX

• ANSYS FLUENT

• ANSYS Mechanics

• BQCD

• CCSM

• CESM

• COSMO

• CP2K

• CPMD

• Dacapo

• Desmond

• DL-POLY

• Eclipse

• FLOW-3D

• GADGET-2

• GROMACS

• Himeno

• HOOMD-blue

• HYCOM

• ICON

• Lattice QCD

• LAMMPS

For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php

Page 3: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

3

Agenda

• Overview Of HPC Application Performance

• Ways To Inspect/Profile/Optimize HPC Applications

– CPU/Memory, File I/O, Network

• System Configurations and Tuning

• Case Studies, Performance Optimization and Highlights

– LS-DYNA (CPU application)

– HOOMD-blue (GPU application)

• Conclusions

Page 4: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

4

HPC Application Performance Overview

• To achieve scalability performance on HPC applications

– Involves understanding of the workload by performing profile analysis

• Tune for the most time spent (either CPU, Network, IO, etc)

– Underlying implicit requirement: Each node to perform similarly

• Run CPU/memory /network tests or cluster checker to identify bad node(s)

– Comparing behaviors of using different HW components

• Which pinpoint bottlenecks in different areas of the HPC cluster

• A selection of HPC applications will be shown

– To demonstrate method of profiling and analysis

– To determine the bottleneck in SW/HW

– To determine the effectiveness of tuning to improve on performance

Page 5: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

5

Ways To Inspect and Profile Applications

• Computation (CPU/Accelerators)

– Tools: top, htop, perf top, pstack, Visual Profiler, etc

– Tests and Benchmarks: HPL, STREAM

• File I/O

– Bandwidth and Block Size: iostat, collectl, darshan, etc

– Characterization Tools and Benchmarks: iozone, ior, etc

• Network Interconnect

– Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc)

– Characterization Tools and Benchmarks:

• Latency and Bandwidth: OSU benchmarks, IMB

Page 6: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

6

Case Study: LS-DYNA

Page 7: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

7

LS-DYNA

• LS-DYNA

– A general purpose structural and fluid analysis simulation software

package capable of simulating complex real world problems

– Developed by the Livermore Software Technology Corporation (LSTC)

• LS-DYNA used by

– Automobile

– Aerospace

– Construction

– Military

– Manufacturing

– Bioengineering

Page 8: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

8

Objectives

• The presented research was done to provide best practices

– Performance benchmarking

• CPU/Memory performance comparison

• MPI Library performance comparison

• Interconnect performance comparison

• The presented results will demonstrate

– The scalability of the compute environment/application

– Considerations for higher productivity and efficiency

Page 9: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

9

Test Cluster Configuration

• Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster

– Dual-Socket Octa-Core Intel E5-2680 V2 @ 2.80 GHz CPUs (Turbo Mode enabled)

– Memory: 64GB DDR3 1600 MHz Dual Rank Memory Module (Static max Perf in BIOS)

– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0

– OS: RHEL 6.2, MLNX_OFED 2.1-1.0.6 InfiniBand SW stack

• Intel Cluster Ready certified cluster

• Mellanox Connect-IB FDR InfiniBand and ConnectX-3 Ethernet adapters

• Mellanox SwitchX 6036 VPI InfiniBand and Ethernet switches

• MPI: Intel MPI 4.1, Platform MPI 9.1, Open MPI 1.7.4 w/ FCA 2.5 & MXM 2.1

• Application: LS-DYNA

– mpp971_s_R3.2.1_Intel_linux86-64 (for TopCrunch)

– ls-dyna_mpp_s_r7_0_0_79069_x64_ifort120_sse2

• Benchmark datasets: 3 Vehicle Collision, Neon refined revised

Page 10: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

10

PowerEdge R720xd Massive flexibility for data intensive operations

• Performance and efficiency

– Intelligent hardware-driven systems management

with extensive power management features

– Innovative tools including automation for

parts replacement and lifecycle manageability

– Broad choice of networking technologies from GigE to IB

– Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits

– Designed for performance workloads

• from big data analytics, distributed storage or distributed computing

where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities

– Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

– Large memory footprint (Up to 768GB / 24 DIMMs)

– High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

Page 11: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

11

LS-DYNA Performance – Turbo Mode

• Turbo Boost enables processors to run above its base frequency

– Capability to allow CPU cores to run dynamically above the CPU clock

– When thermal headroom allows the CPU to operate

– The 2.8GHz clock speed could boost to Max Turbo Frequency of 3.6GHz

• Running with Turbo Boost provides ~8% of performance boost

– More power (~17%) would be drawn for running in Turbo Boost

Higher is better FDR InfiniBand

8%

17%

Page 12: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

12

LS-DYNA Performance – Memory Module

Single Node Higher is better

• Dual rank memory module provides better speedup for LS-DYNA – Using Dual Rank 1600MHz DIMM is 8.9% faster than Single Rank 1866MHz DIMM – Ranking of the memory has more importance to performance than speed

• System components used: – DDR3-1866MHz PC14900 CL13 Single Rank Memory Module – DDR3-1600MHz PC12800 CL11 Dual Rank Memory Module

8.9%

Page 13: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

13

LS-DYNA Performance – File I/O

• Effect of parallel file I/O appears as more cluster nodes participate

– When parallel file I/O access happens, additional traffic appears on network

• Staging I/O on local shm as test, but not a solution for production env

– This is a test to demonstrate the importance of good parallel file system

– It is advised to use on a cluster file system instead of a temporary storage

Higher is better FDR InfiniBand

22%

Page 14: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

14

I/O Profiling with collectl

• Collectl (collect for Linux)

– http://collectl.sourceforge.net/

• Example to run it:

– $ collectl -sdf >> ~/$HOSTNAME.$(date +%y%m%d_%H%M%S).txt

• Example of the collectl report waiting for 1 second sample...

#<----------Disks-----------><------NFS Totals------>

#KBRead Reads KBWrit Writes Reads Writes Meta Comm

0 0 0 0 48 0 5 0

0 0 0 0 53 0 4 0

0 0 64 5 29 1 96 1

0 0 0 0 202 3 2521 3

Page 15: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

15

LS-DYNA Profiling – File I/O

• Some File I/O takes place during the beginning of the run – For nodes with rank 0, as well as for other cluster nodes

Page 16: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

16

LS-DYNA Performance – Interconnects

• FDR InfiniBand delivers superior scalability in application performance

– Provides higher performance by over 7 times % for 1GbE

– Almost 5 times faster than 10GbE and 40GbE

– 1GbE stop scaling beyond 4 nodes, and 10GbE stops scaling beyond 8 nodes

– Only FDR InfiniBand demonstrates continuous performance gain at scale

Higher is better Intel E5-2680v2

707%

474% 492%

Page 17: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

17

LS-DYNA Performance – Open MPI Tuning

NFS File System Higher is better

• FCA and MXM enhance LS-DYNA performance at scale for Open MPI – FCA allows MPI collective operation offloads to hardware while MXM provides

memory enhancements to parallel communication libraries – FCA and MXM provide a speedup of 18% over untuned baseline run at 32 nodes

• MCA parameters for enabling FCA and MXM: – For enabling MXM:

-mca mtl mxm -mca pml cm -mca mtl_mxm_np 0 -x MXM_TLS=ud,shm,self -x MXM_SHM_RNDV_THRESH=32768 -x MXM_RDMA_PORTS=mlx5_0:1

– For enabling FCA: -mca coll_fca_enable 1 -mca coll_fca_np 0 -x fca_ib_dev_name=mlx5_0

10% 2% 18%

Page 18: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

18

LS-DYNA Performance – MPI

• Platform MPI performs better than Open MPI and Intel MPI

– Up to 1% better than Open MPI and 16% better than Intel MPI

• Tuning parameter used: – Open MPI: -bind-to-core, FCA, MXM, KNEM

– Platform MPI: -cpu_bind, -xrc

– Intel MPI: -IB I_MPI_FABRICS shm:ofa I_MPI_PIN on I_MPI_ADJUST_BCAST 1 -genv

MV2_USE_APM 0 -genv I_MPI_OFA_USE_XRC 1 -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx5_0-

1u

FDR InfiniBand Higher is better

1%

16%

Page 19: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

19

LS-DYNA Performance – System Generations

• Intel E5-2600v2 Series (Ivy Bridge) outperforms prior generations – Up to 20% higher performance than Intel Xeon E5-2680 (Sandy Bridge) cluster – Up to 70% higher performance than Intel Xeon X5670 (Westmere) cluster – Up to 167% higher performance than Intel Xeon X5570 (Nehalem) cluster

• System components used: – Ivy Bridge: 2-socket 10-core [email protected], 1600MHz DIMMs, Connect-IB FDR – Sandy Bridge: 2-socket 8-core [email protected], 1600MHz DIMMs, ConnectX-3 FDR – Westmere: 2-socket 6-core [email protected], 1333MHz DIMMs, ConnectX-2 QDR – Nehalem: 2-socket 4-core [email protected], 1333MHz DIMMs, ConnectX-2 QDR

Higher is better FDR InfiniBand

20% 167%

70%

Page 20: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

20

LS-DYNA Performance – TopCrunch

• HPC Advisory Council performs better than the previous best published results

– TopCrunch (www.topcrunch.com) publishes LS-DYNA performance results

– HPCAC achieved better performance on per node basis

– 9% to 27% of higher performance than best published results on TopCrunch (Feb2014)

• Comparing to all platforms on TopCrunch

– HPC Advisory Council results are world best for systems for 2 to 32 nodes

– Achieving higher performance than larger node count systems

Higher is better FDR InfiniBand

14% 10%

21%

24% 10%

11%

21%

27% 9%

25%

Page 21: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

21

LS-DYNA Performance – System Utilization

FDR InfiniBand Higher is better

• Maximizing system productivity by running 2 jobs in parallel – Up to 77% of increased system utilization at 32 nodes – Run each job separately in parallel by splitting system resource in half

• System components used: – Single Job: Use all cores and both IB ports to run a single job – 2 Jobs in parallel: Cores of 1 CPU and 1 IB port for a job, the rest for another

77%

38%

22%

Page 22: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

22

MPI Profiling with IPM

• IPM (Integrated Performance Monitoring)

– http://ipm-hpc.sourceforge.net/ (existing)

– http://ipm2.org/ (new)

• Example to run it:

– LD_PRELOAD=/usr/mpi/gcc/openmpi-1.7.4/tests/ipm-2.0.2/lib/libipm.so

mpirun -x LD_PRELOAD <rest of the cmd>

• To generate HTML for the report

– export IPM_KEYFILE=/usr/mpi/gcc/openmpi-1.7.4/tests/ipm-

2.0.2/etc/ipm_key_mpi

– export IPM_REPORT=full

– export IPM_LOG=full

– ipm_parse -html <username>.<Timestamp>.ipm.xml

Page 23: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

23

LS-DYNA Profiling – MPI/User Time Ratio

• Computation time is dominant compared to MPI communication time

– MPI communication ratio increases as the cluster scales

• Both computation time and communication declines as the cluster scales

– The InfiniBand infrastructure allows spreading the work without adding overheads

– Computation time drops faster compares to communication time

– Compute bound: Tuning for computation performance could yield better results

FDR InfiniBand

Page 24: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

24

LS-DYNA Profiling – MPI Calls

• MPI_Wait, MPI_Send and MPI_Recv are the most used MPI calls – MPI_Wait(31%), MPI_Send(21%), MPI_Irecv(18%), MPI_Recv(14%), MPI_Isend(12%)

• LS-DYNA has majority of MPI point-to-point calls for data transfers – Either blocking or non-blocking point-to-point transfers are seen – LS-DYNA has an extensive use of MPI APIs, over 23 MPI APIs are used

Page 25: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

25

LS-DYNA Profiling – Time Spent by MPI Calls

• Majority of the MPI time is spent on MPI_recv and MPI Collective Ops – MPI_Recv(36%), MPI_Allreduce(27%), MPI_Bcast(24%)

• MPI communication time lowers gradually as cluster scales – Due to the faster total runtime, as more CPUs are working on completing the job faster – Reducing the communication time for each of the MPI calls

Page 26: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

26

LS-DYNA Profiling – MPI Message Sizes

• Most of the MPI messages are in the medium sizes – Most message sizes are between 0 to 64B

• For the most time consuming MPI calls – MPI_Recv: Most messages are under 4KB – MPI_Bcast: Majority are less than 16B, but larger messages exist – MPI_Allreduce: Most messages are less than 256B

3 Vehicle Collision – 32 nodes

Page 27: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

27

LS-DYNA Profiling – MPI Data Transfer

• As the cluster grows, substantial less data transfers between MPI processes

– Drops from ~10GB per rank at 2-node vs to ~4GB at 32-node

– Rank 0 contains higher transfers than the rest of the MPI ranks

– Rank 0 responsible for file IO and uses MPI to communicates with the rest of the ranks

Page 28: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

28

LS-DYNA Profiling – Aggregated Transfer

• Aggregated data transfer refers to:

– Total amount of data being transferred in the network between all MPI ranks collectively

• Large data transfer takes place in LS-DYNA

– Seen around 2TB at 32-node for the amount of data being exchanged between the nodes

FDR InfiniBand

Page 29: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

29

LS-DYNA Profiling – Memory Usage By Node

• Uniform amount of memory consumed for running LS-DYNA

– About 38GB of data is used for per node

– Each node runs with 20 ranks, thus about 2GB per rank is needed

– The same trend continues for 2 to 32 nodes

3 Vehicle Collision – 2 nodes 3 Vehicle Collision – 32 nodes

Page 30: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

30

LS-DYNA – Summary

• Performance

– Compute: Xeon E5-2680v2 (Ivy Bridge) and FDR InfiniBand enable LS-DYNA to scale

• Up to 20% over Sandy Bridge, Up to 70% over Westmere, 167% over Nehalem

– Compute: Running with Turbo Boost provides ~8% of performance boost

• More power (~17%) would be drawn for running in Turbo Boost

– Memory: Dual-ranked memory module provides better speedup than single ranked

• Using Dual Rank 1600MHz DIMM is ~9% faster than single rank 1866MHz DIMM

– I/O: File I/O becomes important at scale, running in /dev/shm performs to ~22% than NFS

– Network: FDR InfiniBand delivers superior scalability in application performance

• Provides higher performance by over 7 times for 1GbE and over 5 times for 10/40GbE at 32 nodes

• Tuning

– FCA and MXM enhances LS-DYNA performance at scale for Open MPI

• Provide a speedup of 7% over untuned baseline run at 32 nodes

– As the CPU/MPI time ratio shows significantly more computation is taken place

• Profiling

– Majority of MPI calls are for (blocking and non-blocking) point-to-point communications

Page 31: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

31

Case Study: HOOMD-blue

Page 32: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

32

HOOMD-blue

• Highly Optimized Object-oriented Many-particle Dynamics - Blue Edition

• Performs general purpose particle dynamics simulations

• Takes advantage of NVIDIA GPUs

• Free, open source

• Simulations are configured and run using simple python scripts

• The development effort is led by Glotzer group at University of Michigan

– Many groups from different universities have contributed code to HOOMD-blue

Page 33: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

33

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: Dell, Mellanox, NVIDIA

– Compute resource –The Wilkes cluster at the University of Cambridge, HPC Advisory

Council Cluster Center

• The following was done to provide best practices

– HOOMD-blue performance overview

– Understanding HOOMD-blue communication patterns

– MPI libraries comparisons

• For more info please refer to

– http://www.dell.com

– http://www.mellanox.com

– http://www.nvidia.com

– http://codeblue.umich.edu/hoomd-blue

– J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular dynamics

simulations fully implemented on graphics processing units Journal of Computational

Physics 227(10): 5342-5359, May 2008. 10.1016/j.jcp.2008.01.047

Page 34: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

34

Objectives

• The following was done to provide best practices – HOOMD-blue performance benchmarking

– Interconnect performance comparisons

– MPI performance comparison

– Understanding HOOMD-blue communication patterns

• The presented results will demonstrate – The scalability of the compute environment to provide nearly linear

application scalability

– The capability of HOOMD-blue to achieve scalable productivity

Page 35: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

35

Test Cluster Configuration 1

• Dell™ PowerEdge™ R720xd/R720 cluster

– Dual-Socket Octa-core Intel E5-2680 V2 @ 2.80 GHz CPUs (Static max Perf in BIOS)

– Memory: 64GB DDR3 1600 MHz Dual Rank Memory Module

– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0

– OS: RHEL 6.2, MLNX_OFED 2.1-1.0.0 InfiniBand SW stack

• Mellanox Connect-IB FDR InfiniBand

• Mellanox SwitchX SX6036 InfiniBand VPI switch

• NVIDIA® Tesla K40 GPUs (1 GPU per node), ECC enabled via nvidia-smi

• NVIDIA® CUDA® 5.5 Development Tools and Display Driver 331.20

• GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)

• MPI: Open MPI 1.7.4rc1

• Application: HOOMD-blue (git master 28Jan14)

• Benchmark datasets: Lennard-Jones Liquid Benchmarks (16K, 64K Particles)

Page 36: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

36

• Eliminates CPU bandwidth and latency bottlenecks

• Uses remote direct memory access (RDMA) transfers between GPUs

• Resulting in significantly improved MPI_Send/Recv efficiency between

GPUs in remote nodes

GPUDirect™ RDMA

With GPUDirect™ RDMA

Using PeerDirect™

Page 38: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

38

• MVAPICH2-GDR 2.0b

• Requires CUDA 5.5

• Required parameters:

– MV2_USE_CUDA

– MV2_USE_GPUDIRECT

• Optional tuning parameters: – MV2_GPUDIRECT_LIMIT

– MV2_USE_GPUDIRECT_RECEIVE_LIMIT

– MV2_RAIL_SHARING_POLICY=FIXED_MA

PPING

MV2_PROCESS_TO_RAIL_MAPPING=mlx5

_0:mlx5_1

MV2_RAIL_SHARING_LARGE_MSG_THRE

SHOLD=1G

GPUDirect RDMA – MPI stack

• Open MPI 1.7.5

• Requires CUDA 6.0

• Required parameter:

– btl_openib_want_cuda_gdr

• Optional tuning parameters: – btl_openib_cuda_rdma_limit

• Default 30,000B in size

– btl_openib_cuda_eager_limit

– btl_smcuda_use_cuda_ipc

38

Page 39: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

39

HOOMD-blue Performance – GPUDirect RDMA

• GPUDirect RDMA enables higher performance on a small GPU cluster

– Demonstrated up to 20% of higher performance at 4 nodes for 16K particles

– Showed up to 10% of performance gain at 4 nodes for 64K particles

• Adjusting OMPI MCA param can maximize GPUDirect RDMA usage

– Based on MPI profiling, limits for GDR for 64K particles was tuned to 65KB

• MCA Parameter to enable and tune GPUDirect RDMA for Open MPI:

– -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit XXXX

Higher is better Open MPI

20% 10%

Page 40: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

40

HOOMD-blue Profiling – MPI Message Sizes

• HOOMD-blue utilizes non-blocking and collectives for most data transfers

– 16K particles: MPI_Isend/MPI_Irecv are concentrated between 4B to 24576B

– 64K particles: MPI_Isend/MPI_Irecv are concentrated between 4B to 65536B

• MCA parameter used to enable and tune for GPUDirect RDMA

– 16K particles: Default would allow all send/recv to use GPUDirect RDMA

– 64K particles: Maximize GDR by tuning MCA param to include up to 65KB

• -mca btl_openib_cuda_rdma_limit 65537 (Change for 64K particles case)

1 MPI Process/Node

4 Nodes – 16K Particles 4 Nodes – 64K Particles

Page 41: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

41

Test Cluster Configuration 2

• Dell™ PowerEdge™ T620 128-node (1536-core) Wilkes cluster at Univ of Cambridge

– Dual-Socket Hexa-Core Intel E5-2630 v2 @ 2.60 GHz CPUs

– Memory: 64GB memory, DDR3 1600 MHz

– OS: Scientific Linux release 6.4 (Carbon), MLNX_OFED 2.1-1.0.0 InfiniBand SW stack

– Hard Drives: 2x 500GB 7.2 RPM 64MB Cache SATA 3.0Gb/s 3.5”

• Mellanox Connect-IB FDR InfiniBand adapters

• Mellanox SwitchX SX6036 InfiniBand VPI switch

• NVIDIA® Tesla K20 GPUs (2 GPUs per node), ECC enabled in nvidia-smi

• NVIDIA® CUDA® 5.5 Development Tools and Display Driver 331.20

• GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)

• MPI: Open MPI 1.7.4rc1, MVAPICH2-GDR 2.0b

• Application: HOOMD-blue (git master 28Jan14)

• Benchmark datasets: Lennard-Jones Liquid Benchmarks (256K and 512K Particles)

Page 42: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

42

The Wilkes Cluster at University of Cambridge

• The University of Cambridge in partnership with Dell, NVIDIA and Mellanox

– The UK’s fastest academic cluster, deployed November 2013

• Produces a LINPACK performance of 240TF

– on the Top500 position of 166 in the November 2013 list

• Ranked most energy efficient air cooled supercomputer in the world

• Ranked second in the worldwide Green500 ranking

– Extremely high performance per watt of 3631 MFLOP/W

• Architected to utilize the NVIDIA RDMA communication acceleration

– Significantly increase the system's parallel efficiency

Page 43: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

43

GPUDirect™ RDMA at Wilkes Cluster

GPU Direct peer-to-peer

Two GPU on the same PCI lane

Intra-node communication

GPU Direct over RDMA

NIC and GPU on the same PCI lane

Inter-node communication

Page 44: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

44

GPU Direct RDMA

• Open MPI 1.7.5:

– mpirun -np $NP -map-by ppr:1:socket --bind-to socket

-mca rmaps_base_dist_hca mlx5_0:1

-mca coll_fca_enable 0 -mca mtl ^mxm

-mca btl_openib_want_cuda_gdr 1 -mca btl_smcuda_use_cuda_ipc 0

-mca btl_smcuda_use_cuda_ipc_same_gpu 0 <app>

• MVAPICH2-GDR 2.0b :

– mpirun -np $NP -ppn 2 -genvall

-genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING

-genv MV2_PROCESS_TO_RAIL_MAPPING mlx5_0

-genv MV2_ENABLE_AFFINITY 1

-genv MV2_CPU_BINDING_LEVEL SOCKET

-genv MV2_CPU_BINDING_POLICY SCATTER

-genv MV2_USE_CUDA 1 -genv MV2_CUDA_IPC 0

-genv MV2_USE_GPUDIRECT 1 <app>

Page 45: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

45

HOOMD-blue Performance – GPUDirect RDMA

• GPUDirect RDMA unlocks performance between GPU and IB

– Demonstrated up to 102% of higher performance at 64 nodes

• GPUDirect RDMA provides a direct P2P data path between GPU and IB

– This new technology significantly lowers GPU-GPU communication latency

– Completely offload CPU from all GPU communications across the network

• MCA param to enable GPUDirect RDMA between 1 GPU and IB per node

• --mca btl_openib_want_cuda_gdr 1 (Default value for btl_openib_cuda_rdma_limit)

Higher is better Open MPI

102%

Page 46: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

46

HOOMD-blue Performance – GPUDirect RDMA

Higher is better Open MPI

Page 47: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

47

HOOMD-blue Performance – Scalability

• FDR InfiniBand empowers Wilkes to surpass Titan on scalability

– Titan showed higher per-node performance but Wilkes outperformed in scalability

– Titan: K20x GPUs which computes at higher clock rate than the K20 GPU

– Wilkes: K20 GPUs at PCIe Gen2, and FDR InfiniBand at Gen3 rate

• Wilkes exceeds Titan in scalability performance with FDR InfiniBand

– Outperformed Titan by up to 114% at 32 nodes

1 Process/Node

114% 91%

Page 48: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

48

HOOMD-blue Performance – MPI

• Open MPI performs better than MVAPICH2-GDR

– At lower scale, MVAPICH2-GDR performs slight better

– At higher scale, Open MPI shows better scalability

– Locality of IB interface used was explicitly specified with flags when tests were run

• Both MPI implementations are in their beta releases

– Scalability performance are expected to improve on their release versions

– An Issue prevented MVAPICH2-GDR from running for 8 and 64 nodes

Higher is better 1 Process/Node

Page 49: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

49

HOOMD-blue Performance-Host-buffer Staging

• HOOMD-blue can run w/ non-CUDA aware MPI using Host Buffer Staging

– HOOMD-blue is built using “ENABLE_MPI=ON" and "ENABLE_MPI_CUDA=OFF” flags

– Non-CUDA aware (or host) MPI has lower latency than CUDA aware MPI

– With GDR: CUDA-aware MPI is copied Individually. Slightly higher latency with MPI

– With HBS: Only single large buffers are copied as needed. Lower latency using MPI

• GDR performs on par with HBS on large scale, better in some cases

– On large scale, HBS performance appears to perform slightly faster than GDR

– On small scale, GDR can be faster than HBS when small number of particles per GPU

Higher is better 1 Process/Node

Page 50: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

50

HOOMD-blue Profiling – % Time Spent on MPI

• HOOMD-blue utilizes both non-blocking and collective ops for comm

– Changes in network communications take place as cluster scales

– 4 nodes: MPI_Waitall(75%), the rest are MPI_Bcast and MPI_Allreduce

– 96 nodes: MPI_Bcast (35%), the rest are MPI_Allreduce, MPI_Waitall

Open MPI

96 Nodes – 512K Particles 4 Nodes – 512K Particles

Page 51: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

51

HOOMD-blue Profiling – MPI Communication

• Each rank engages in similar network communication

– Except for rank 0, which spends less time in MPI_Bcast

1 MPI Process/Node

96 Nodes – 512K Particles 4 Nodes – 512K Particles

Page 52: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

52

HOOMD-blue Profiling – MPI Message Sizes

• HOOMD-blue utilizes non-blocking and collectives for most data transfers

– 4 Nodes: MPI_Isend/MPI_Irecv are concentrated between 28KB to 229KB

– 96 Nodes: MPI_Isend/MPI_Irecv are concentrated between 64B to 16KB

• GPUDirect RDMA is enabled for messages between 0B to 30KB

– MPI_Isend/_Irecv messages are able to take advantage of GPUDirect RDMA

– Messages fitted within the (tunable default of) 30KB window can be benefited

1 MPI Process/Node

96 Nodes – 512K Particles 4 Nodes – 512K Particles

Page 53: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

53

HOOMD-blue Profiling – Point to Point Transfer

• Distribution of data transfers between the MPI processes

– Non-blocking point-to-point data communications between processes are involved

– Less data is being transferred as more ranks being part of the run

1 MPI Process/Node

96 Nodes – 512K Particles 4 Nodes – 512K Particles

Page 54: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

54

HOOMD-blue – Summary

• HOOMD-blue demonstrates good use of GPU and InfiniBand at scale

– FDR InfiniBand is the interconnect allows HOOMD-blue to scale

– Ethernet solutions would not scale beyond 1 node

• GPUDirect RDMA

– This new technology provides a direct P2P data path between GPU and IB

– This provides a significant decrease in GPU-GPU communication latency

• GPUDirect RDMA unlocks performance between GPU and IB

– Demonstrated up to 20% of higher performance at 4 nodes for 16K case

– Demonstrated up to 102% of higher performance at 96 nodes for 512K case

• InfiniBand empowers Wilkes to surpass Titan on scalability performance

– Titan has higher per-node performance but Wilkes outperforms in scalability

– Outperforms Titan by 114% at 32 nodes

• GPUDirect RDMA performs on par with Host Buffer Staging

– On large scale, HBS performance appears to perform slightly faster than GDR

– On small scale, GDR can be faster than HBS when small num. of particles per GPU

Page 55: HPC Applications Performance and Optimizations Best Practices · 2020. 1. 14. · 9 Test Cluster Configuration • Dell™ PowerEdge™ R720xd/R720 32-node (640-core) cluster –

55 55

Thank You HPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein