Addressing Emerging Challenges in Designing HPC Runtimes ......Addressing Emerging Challenges in...

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPCAC-Switzerland (Mar ‘16)

by

http://www.cse.ohio-state.edu/~panda

HPCAC-Switzerland (Mar ‘16) 2Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• InfiniBand Network Analysis and Monitoring (INAM)

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Virtualization (SR-IOV and Containers)

• Energy-Awareness

• Best Practice: Set of Tunings for Common Applications

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale


• Integrated Support for GPGPUs– CUDA-Aware MPI

– GPUDirect RDMA (GDR) Support

– CUDA-aware Non-blocking Collectives

– Support for Managed Memory

– Efficient datatype Processing

– Supporting Streaming applications with GDR

– Efficient Deep Learning with MVAPICH2-GDR







PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(s_hostbuf, s_devbuf, . . .);

MPI_Send(s_hostbuf, size, . . .);

At Receiver:

MPI_Recv(r_hostbuf, size, . . .);

cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance

MPI + CUDA - Naive


PCIe

GPU

CPU

NIC

Switch

At Sender:for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance

MPI + CUDA - Advanced


At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware MPI Library: MVAPICH2-GPU


• OFED with support for GPUDirect RDMA is

developed by NVIDIA and Mellanox

• OSU has a design of MVAPICH2 using

GPUDirect RDMA

– Hybrid design using GPU-Direct RDMA

• GPUDirect RDMA and Host-based pipelining

• Alleviates P2P bandwidth bottlenecks on SandyBridge and

IvyBridge

– Support for communication using multi-rail

– Support for Mellanox Connect-IB and ConnectX VPI adapters

– Support for RoCE with Mellanox ConnectX VPI adapters

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s

P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s

P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2


CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers


MVAPICH2-GDR-2.2bIntel Ivy Bridge (E5-2680 v2) node - 20 cores

NVIDIA Tesla K40c GPUMellanox Connect-IB Dual-FDR HCA

CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA

10x

2X

11x

2x

Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)

0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.2b MV2-GDR2.0bMV2 w/o GDR

GPU-GPU internode latency

Message Size (bytes)

Late

ncy

(u

s)

2.18us0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1K 4K

MV2-GDR2.2b

MV2-GDR2.0b

MV2 w/o GDR

GPU-GPU Internode Bandwidth


Ban

dw

idth

(M

B/s

) 11X

0

1000

2000

3000

4000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth


Bi-

Ban

dw

idth

(M

B/s

)


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768

MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Ave

rage

Tim

e St

eps

per

se

con

d (

TPS)

Number of Processes

MV2 MV2+GDR

0

500

1000

1500

2000

2500

3000

3500

4 8 16 32Ave

rage

Tim

e St

eps

pe

r se

con

d (

TPS)

Number of Processes

64K Particles 256K Particles

2X2X

mailto:[email protected]

mailto:[email protected]


0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Ove

rlap

(%

)

Message Size (Bytes)

Medium/Large Message Overlap (64 GPU nodes)

Ialltoall (1process/node)

Ialltoall (2process/node; 1process/GPU)0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Ove

rlap

(%

)


Medium/Large Message Overlap(64 GPU nodes)

Igather (1process/node)

Igather (2processes/node;1process/GPU)

Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB

Available since MVAPICH2-GDR 2.2a

CUDA-Aware Non-Blocking Collectives

A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU

Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC,

2015


Communication Runtime with GPU Managed Memory

● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)

memory allowing a common memory allocation for GPU

or CPU through cudaMallocManaged() call

● Significant productivity benefits due to abstraction of

explicit allocation and cudaMemcpy()

● Extended MVAPICH2 to perform communications directly

from managed buffers (Available in MVAPICH2-GDR 2.2b)

● OSU Micro-benchmarks extended to evaluate the

performance of point-to-point and collective

communications using managed buffers

● Available in OMB 5.2

D S Banerjee, K Hamidouche, DK Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences at GPGPU-9 Workshop held in conjunction with PPoPP 2016. Barcelona Spain

0

2

4

6

8

10

12

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

Late

ncy

(u

s)


Latency

H-H MH-MH

0

1000

2000

3000

4000

5000

6000

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4Ban

dw

idth

(M

B/s

)


Bandwidth

D-D MD-MD


CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Init

iate

Ke

rnel

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)

Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Init

iate

Ke

rnel

Star

t Se

nd


Kernel on Stream

Isend(1)

Init

iate

Ke

rnel

Star

t Se

nd


Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*Buf1, Buf2…contain non-contiguous MPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);


Application-Level Evaluation (HaloExchange - Cosmo)

0

0.5

1

1.5

16 32 64 96

No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

CSCS GPU clusterDefault Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

Wilkes GPU ClusterDefault Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-

Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16


• Pipelined data parallel compute phases that form the crux of streaming applications lend themselves for GPGPUs

• Data distribution to GPGPU sites occur over PCIe within the node and over InfiniBand interconnects across nodes

Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heuristic for streaming applications on the grid." Electronic Imaging 2006

• Broadcast operation is a key dictator of throughput of streaming applications

• Current Broadcast operation on GPU clusters does not take advantage of• IB Hardware MCAST• GPU Direct RDMA

Nature of Streaming Applications


SGL-based design for Efficient Broadcast Operation on GPU Systems

• Current design is limited by the expensive copies from/to GPUs

• Proposed several alternative designs to avoid the overhead of the copy • Loopback, GDRCOPY and hybrid • High performance and scalability • Still uses PCI resources for Host-GPU copies

• Proposed SGL-based design • Combines IB MCAST and GPUDirect RDMA features • High performance and scalability for D-D broadcast• Direct code path between HCA and GPU • Free PCI resources

• 3X improvement in latency

3X

A. Venkatesh , H. Subramoni , K. Hamidouche , and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and

GPUDirect RDMA for Streaming Applications on InfiniBand Clusters , IEEE Int’l Conf. on High Performance Computing (HiPC ’14)


Accelerating Deep Learning with MVAPICH2-GDR

• Caffe: A flexible and layered Deep Learning

framework.

• Benefits and Weaknesses

– Multi-GPU Training within a single node

– Performance degradation for GPUs across

different sockets

• Can we enhance Caffe with MVAPICH2-GDR?

– Caffe-Enhanced: A CUDA-Aware MPI version

– Enables Scale-up (within a node) and Scale-

out (across multi-GPU nodes)

– Initial Evaluation suggests up to 8X reduction

in training time on CIFAR-10 dataset

8x improvement


MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

OffloadedComputation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi


MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only and Symmetric Mode

• Internode Communication

• Coprocessors-only and Symmetric Mode

• Multi-MIC Node Configurations

• Running on three major systems

• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)


MIC-Remote-MIC P2P Communication with Proxy-based Communication

Bandwidth

Bette

r

Bet

ter

Bet

ter

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)


0

2000

4000

6000

1 16 256 4K 64K 1MBan

dw

idth

(M

B/s

ec)


5236

Intra-socket P2P

Inter-socket P2P

0

5000

10000

15000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)


Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1MBan

dw

idth

(MB

/se

c)Message Size (Bytes)

Bette

r

5594

Bandwidth


Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

10000

20000

30000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

secs

)


32-Node-Allgather (16H + 16 M)Small Message LatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

secs

)


32-Node-Allgather (8H + 8 M)Large Message LatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(u

secs

)


32-Node-Alltoall (8H + 8 M)Large Message LatencyMV2-MIC

MV2-MIC-Opt

0

20

40

60

MV2-MIC-Opt MV2-MICExe

cuti

on

Tim

e (

secs

)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT PerformanceCommunication

Computation

76%

58%

55%


• Virtualization has many benefits– Fault-tolerance

– Job migration

– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV

• MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available

• How about the Containers support?

Can HPC and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized

InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14

J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15


• Redesign MVAPICH2 to make it

virtual machine aware

– SR-IOV shows near to native

performance for inter-node point to

point communication

– IVSHMEM offers zero-copy access to

data on shared memory of co-resident

VMs

– Locality Detector: maintains the locality

information of co-resident virtual machines

– Communication Coordinator: selects the

communication channel (SR-IOV, IVSHMEM)

adaptively

Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2

user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual

Function

Virtual

Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM

Shmem Benefit MPI Applications on SR-IOV based

Virtualized InfiniBand Clusters? Euro-Par, 2014.

J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High

Performance MPI Library over SR-IOV Enabled InfiniBand

Clusters. HiPC, 2014.


• OpenStack is one of the most popular

open-source solutions to build clouds and

manage virtual machines

• Deployment with OpenStack

– Supporting SR-IOV configuration

– Supporting IVSHMEM configuration

– Virtual Machine aware design of MVAPICH2

with SR-IOV

• An efficient approach to build HPC Clouds

with MVAPICH2-Virt and OpenStack

MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack

J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build

HPC Clouds. CCGrid, 2015.


0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

Exe

cuti

on

Tim

e (

s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

Exe

cuti

on

Tim

e (

ms)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native2%

• 32 VMs, 6 Core/VM

• Compared to Native, 2-5% overhead for Graph500 with 128 Procs

• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007Graph500

5%


NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument

• Large-scale instrument

– Targeting Big Data, Big Compute, Big Instrument research

– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network

• Reconfigurable instrument

– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use

• Connected instrument

– Workload and Trace Archive

– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others

– Partnerships with users

• Complementary instrument

– Complementing GENI, Grid’5000, and other testbeds

• Sustainable instrument

– Industry connections http://www.chameleoncloud.org/

http://www.chameleoncloud.org/


0

2

4

6

8

10

12

14

16

18

1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Late

ncy

(u

s)


Container-Def

Container-Opt

Native

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Ban

dw

idth

(M

Bp

s)


Container-Def

Container-Opt

Native

• Intra-Node Inter-Container

• Compared to Container-Def, up to 81% and 191% improvement on Latency and BW

• Compared to Native, minor overhead on Latency and BW

Containers Support: MVAPICH2 Intra-node Point-to-Point Performance on Chameleon

81%

191%


0

500

1000

1500

2000

2500

3000

3500

4000

22, 16 22, 20 24, 16 24, 20 26, 16 26, 20

Exec

uti

on

Tim

e (

ms)

Problem Size (Scale, Edgefactor)

Container-Def

Container-Opt

Native

0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

Exec

uti

on

Tim

e (

s)

Container-Def

Container-Opt

Native

• 64 Containers across 16 nodes, pining 4 Cores per Container

• Compared to Container-Def, up to 11% and 16% of execution time reduction for NAS and Graph 500

• Compared to Native, less than 9 % and 4% overhead for NAS and Graph 500

• Optimized Container support will be available with the next release of MVAPICH2-Virt

Containers Support: Application-Level Performance on Chameleon

Graph 500NAS

11%

16%


Designing Energy-Aware (EA) MPI Runtime

Energy Spent in Communication

Routines

Energy Spent in Computation

Routines

Overall application Energy

Expenditure

Point-to-point

Routines

Collective

RoutinesRMA Routines

MVAPICH2-EA Designs

MPI Two-sided and collectives (ex: MVAPICH2)

Other PGAS Implementations (ex: OSHMPI)One-sided runtimes (ex: ComEx)

Impact MPI-3 RMA Implementations (ex: MVAPICH2)


• MVAPICH2-EA 2.1 (Energy-Aware)

• A white-box approach

• New Energy-Efficient communication protocols for pt-pt and collective operations

• Intelligently apply the appropriate Energy saving techniques

• Application oblivious energy saving

• OEMT

• A library utility to measure energy consumption for MPI applications

• Works with all MPI runtimes

• PRELOAD option for precompiled applications

• Does not require ROOT permission:

• A safe kernel module to read only a subset of MSRs

Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)


• An energy efficient runtime that

provides energy savings without

application knowledge

• Uses automatically and

transparently the best energy

lever

• Provides guarantees on

maximum degradation with 5-

41% savings at <= 5%

degradation

• Pessimistic MPI applies energy

reduction lever to each MPI call

MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.

K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]

1


MPI-3 RMA Energy Savings with Proxy-Applications

0

10

20

30

40

50

60

512 256 128

Sec

ond

s

#Processes

Graph500 (Execution Time)

optimistic

pessimistic

EAM-RMA

0

50000

100000

150000

200000

250000

300000

350000

512 256 128

Joule

s

#Processes

Graph500 (Energy Usage)

optimistic

pessimistic

EAM-RMA46%

• MPI_Win_fence dominates application execution time in graph500

• Between 128 and 512 processes, EAM-RMA yields between 31% and 46% savings with no

degradation in execution time in comparison with the default optimistic MPI runtime


0

500000

1000000

1500000

2000000

2500000

3000000

512 256 128

Joule

s

#Processes

SCF (Energy Usage)

optimistic

pessimistic

EAM-RMA

0

100

200

300

400

500

600

512 256 128

Sec

ond

s

#Processes

SCF (Execution Time)

optimistic

pessimistic

EAM-RMA

MPI-3 RMA Energy Savings with Proxy-Applications

42%

• SCF (self-consistent field) calculation spends nearly 75% total time in MPI_Win_unlock call

• With 256 and 512 processes, EAM-RMA yields 42% and 36% savings at 11% degradation (close to

permitted degradation ρ = 10%)

• 128 processes is an exception due 2-sided and 1-sided interaction

• MPI-3 RMA Energy-efficient support will be available in upcoming MVAPICH2-EA release


• MPI runtime has many parameters

• Tuning a set of parameters can help you to extract higher performance

• Compiled a list of such contributions through the MVAPICH Website– http://mvapich.cse.ohio-state.edu/best_practices/

• Initial list of applications– Amber

– HoomdBlue

– HPCG

– Lulesh

– MILC

– MiniAMR

– Neuron

– SMG2000

• Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. We will link these results with credits to you.

Applications-Level Tuning: Compilation of Best Practices

http://mvapich.cse.ohio-state.edu/best_practices/


MVAPICH2 – Plans for Exascale• Performance and Memory scalability toward 1M cores

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Support for task-based parallelism (UPC++)*

• Enhanced Optimization for GPU Support and Accelerators

• Taking advantage of advanced features of Mellanox InfiniBand• On-Demand Paging (ODP)

• Swith-IB2 SHArP

• GID-based support

• Enhanced Inter-node and Intra-node communication schemes for upcoming architectures• OpenPower*

• OmniPath-PSM2*

• Knights Landing

• Extended topology-aware collectives

• Extended Energy-aware designs and Virtualization Support

• Extended Support for MPI Tools Interface (as in MPI 3.0)

• Extended Checkpoint-Restart and migration support with SCR

• Support for * features will be available in MVAPICH2-2.2 RC1


• Exascale systems will be constrained by– Power

– Memory per core

– Data movement cost

– Faults

• Programming Models and Runtimes for HPC need to be designed for

– Scalability

– Performance

– Fault-resilience

– Energy-awareness

– Programmability

– Productivity

• Highlighted some of the issues and challenges

• Need continuous innovation on all these fronts

Looking into the Future ….


Funding Acknowledgments

Funding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students

– A. Augustine (M.S.)

– A. Awan (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist

– S. Sur

Current Post-Doc

– J. Lin

– D. Banerjee

Current Programmer

– J. Perkins

Past Post-Docs

– H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Research Scientists Current Senior Research Associate

– H. Subramoni

– X. Lu

Past Programmers

– D. Bureddy

- K. Hamidouche

Current Research Specialist

– M. Arnold


International Workshop on Communication Architectures at Extreme Scale (Exacomm)

ExaComm 2015 was held with Int’l Supercomputing Conference (ISC ‘15), at Frankfurt,

Germany, on Thursday, July 16th, 2015

One Keynote Talk: John M. Shalf, CTO, LBL/NERSCFour Invited Talks: Dror Goldenberg (Mellanox); Martin Schulz (LLNL);

Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL)Panel: Ron Brightwell (Sandia)

Two Research Papers

ExaComm 2016 will be held in conjunction with ISC ’16

http://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html

Technical Paper Submission Deadline: Friday, April 15, 2016


[email protected]

Thank You!

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

Addressing Emerging Challenges in Designing HPC Runtimes ......Addressing Emerging Challenges in...

Documents

Transcript of Addressing Emerging Challenges in Designing HPC Runtimes ......Addressing Emerging Challenges in...